Should big computers run small calculations?

The Top500 list ranks the most powerful computers in the world.[1] As of this writing, the top spot belongs to Tianhe-2, a Chinese computer capable of 33.86 petaflops (quadrillion computations per second). That’s perhaps a million times the performance of the computer you are viewing this on now.[2] Many of the remaining top 10 computers are located at national laboratories around the United States, including Titan at Oak Ridge National Laboratory and Sequoia at Lawrence Livermore National Laboratory (both reaching 17 petaflops).

The supercomputer rankings are based on the LINPACK benchmark, which is essentially a measure of a computer’s “horsepower”. While there are endless discussions about employing better benchmarks that give a more practical “top speed”, perhaps a more important problem is that many researchers needing large amounts of computing don’t care much about the top speed of their computer. Rather than supercharged dragsters that set land speed records, these scientists require a fleet of small and reliable vans that deliver lots of total mileage over the course of a year.


The fundamental difference between the “drag racers” and the “delivery van operators” is whether the simulations coordinate all the computer processors towards a common task. In combustion studies and climate simulations, scientists typically run a single large computing job that coordinates results between tens or even hundreds of thousands of processors. These projects employ “computer jocks” that are the equivalent of a NASCAR pit crew; they tweak the supercomputers and simulation code to run at peak performance. Meanwhile, many applications in biology and materials science require many small independent simulations – perhaps one calculation per gene sequence or one per material. Each calculation consumes only a modest amount of processing power; however, the total computational time needed for such applications can now rival the bigger simulations because millions of such tiny simulations might be needed. Like running a delivery service, the computer science problems tend to be more about tracking and managing all these computing jobs rather than breaking speed records. And instead of a highly parallel machine tuned for performance, many inexpensive computers would do just fine.

These two modes of computing are sometimes called “high-performance computing” (or HPC) for the large coordinated simulations and “high-throughput computing” (or HTC) for myriad small simulations. They are not always at odds. Just as consumer cars benefit from improvements on the racetrack, today’s high-throughput computing has benefited from designing yesterday’s high-performance systems. And the machines built for high-performance computing can sometimes still be used for high-throughput computing while maintaining low cost per CPU cycle and high energy efficiency. But, fundamental differences prevent the relationship from being completely symbiotic.

Making deliveries in racecars

One issue with large supercomputers concerns drivability – how easy is it to use and program the (often complex) computing architecture? Unfortunately, tweaking the computers to go faster often means making them less flexible. Sometimes, enhancing performance for large simulations prevents small jobs from running at all. The Mira Blue Gene cluster at Argonne Lab (currently the #5 system in the Top500 list) has a minimum partition of over 8000 processors and a low ratio of I/O nodes to compute nodes, meaning the machine is fundamentally built for large computations and doesn’t support thousands of small ones writing files concurrently. These top supercomputing machines are built like Death Stars: they are designed for massive coordinated power towards a single target but can potentially crumble under a load of many small jobs.

You must be this tall to ride the supercomputer
Running high-throughput tasks on large supercomputer queues might require dressing up small jobs as larger ones (the disguise doesn’t always work).

Other times, there are no major technical barriers to running small jobs on large machines. However, “run limits” imposed by the computing centers effectively enforce that only large computing tasks achieve large total computer time. One can bypass this problem by writing code that allows a group of small jobs to masquerade as a big job.[3] It doesn’t always work, and is a crude hack even when it does.

If supercomputers are naturally suited to support large jobs, why not take small high-throughput jobs elsewhere? A major reason is cost. The Materials Project currently consumes approximately 20 million CPU-hours and 50 terabytes of storage a year. This level of computing might cost two million dollars a year[4] on a cloud computing service such as Amazon EC2 but can be gotten essentially for “free” by writing a competitive science grant to use the existing government supercomputers (for example, two of these centers – Argonne and Oak Ridge – have a combined computing budget of 8 billion CPU-hours*).

Still, many high-throughput computational scientists have chosen not to employ supercomputers. The Harvard Clean Energy Project, an effort to discover new organic solar cells through millions of small simulations, leverages the personal computers of thousands of citizen volunteers via the IBM World Community Grid. Cycle Computing is a business that maps scientific computing problems to cloud computing services. The National Science Foundation is funding computing centers such as XSEDE that are more accommodating to massive amounts of small jobs. Adopting politically-tinged slogans like “Supercomputing/HPC for the 99 percent”[5] in their press releases, these efforts are actively rebelling against the majority of computer time going to the small minority of computer jocks that can operate the supercomputing behemoths.

Can’t we all just get along?

The HPC/HTC divide is discouraging, especially because large and small jobs are not completely immiscible. The National Energy Research Scientific Computing Center (NERSC) has added a high-throughput queue to its large production cluster, Hopper (a 1 petaflop machine that was at one time the #8 computer on the Top500 list). In theory, this could be of mutual benefit. Supercomputing centers are partly evaluated by their utilization, the average percent of computing cycles that are used for scientific work (versus waiting idly for an appropriate-size task). A small job can fill a gap in computing like a well-shaped Tetris piece, thereby improving utilization. The two communities can certainly cooperate more.[6] HPC and HTC certainly belong to different computing families, but a Montague-Capulet style war may only hurt both sides.

[1] Many people believe that supercomputers owned by government agencies or commercial entities such as Google would win if they decided to enter into the competition.
[2] I’m assuming ~30 gigaflops LINPACK performance for a desktop based on this article in The Register.
[3] The FireWorks software I develop is one workflow code that has this feature built-in.
[4] (9/18/2014) I’ve heard an estimate that computer time at DOE leadership computing facilities costs about 2.5 cents per core-hour, such that 20 million CPU-hours would be equivalent to about $500,000.
[5] For example, see these articles released by San Diego Computing Center and in Ars Technica.
[6] Note: I am part of a queueing subcommittee at NERSC that hopes to start investigating this soon.
* (2/7/2014) The original version of this article stated that the DOE had a total 2013 computing budget of 2.4 billion CPU-hours, which is too low. The updated figure of 8 billion for just two top centers is based on this presentation.

Point-and-Compute: Quantum Mechanics in “Auto” Mode

Ansel Adams is famous for capturing iconic black and white photographs of Yosemite Park and the American West during the early and mid 1900s. In those early days of photography, landscape photographers needed to be hardy. To capture his seminal Monolith image, Ansel climbed 2500 feet of elevation in the snow carrying a large format camera and wooden tripod. In addition to such physical stamina, photographers needed to possess a deep and intuitive mental understanding of how their camera inputs – shutter speed, focal length, film type, and lens filters – achieved the final output – exposure, sharpness, depth of field, and contrast. Ansel had only two of his large glass plate negatives remaining when he reached the spot to photograph the Monolith. Without the luxury of infinite shots or digital previews, he had to previsualize all the complex interactions between the outside light and his camera equipment before opening the shutter.[1]

Today, most photographs at Yosemite are likely taken by cell phones carried in a pocket along well-marked trails. All the decisions of the photographic process – shutter speed, white balance, ISO speed – are made by the camera. After the capture, one can select amongst dozens of predesigned post-processing treatments to get the right “look” – perhaps a high-contrast black and white reminiscent of Adams. Today, one can be a photographer whilst knowing almost nothing about the photographic process. Even today’s professionals offload many of the details of determining exposure and white balance to their cameras, taking advantage of the convenience and opportunities that digital photography offers.

Computational materials science is now following a similar trajectory to that of photography. In the not too distant past, one needed to know both DFT theory and the details of its implementation to have any hope of computing an accurate result. Computing time was limited, so each calculation setting was thought through and adjusted manually before “opening the shutter” and performing the calculation.

The new way of computing materials properties is starting to look different than the past
The new generation of materials scientists might compute materials properties differently than their research advisors.

These days, DFT software packages offer useful default settings that work over a wide range of materials. In addition, software “add-ons” are starting to promise essentially “point-and-compute” technology – that is, give the software a material and a desired property and it will figure out all the settings and execute the calculation workflow. Performing a DFT computation is still not as easy as snapping an iPhone picture, but it is becoming more and more accessible to experimental groups and those sitting outside theory department walls.

But, along with these new steps forward in automation, might we be losing something as well?

Image quality

Often one of the first victims of increased convenience is quality of the final product. The large format negative used by Ansel Adams to capture “The Monolith” was likely capable of recording over 500 megapixels of detail,[2] much more than today’s highest end digital camera (and a far cry from Instagram snapshots).

In the case of DFT, both manual and semi-automatic computations employ the same backend software packages, so it is not as if the automatic “cameras” are of poorer quality. Careful manual control of the calculation settings still yields the most precise results. However, algorithms for fully automatic calculation parameters are growing in sophistication and may someday be embraced almost universally, perhaps in the same manner as the automatic white balance feature in cameras.

Post processing

More controversial, perhaps, is the rise of automated post-processing routines to “correct” the raw computed data in areas of known DFT inaccuracy.[3] Such techniques are how cell phone cameras provide good images despite having poor sensors and lenses: post-processing algorithms reduce noise and boost sharpness and color to make the final image look better than the raw data. The danger is potentially overcorrecting and making things worse. Fundamentally-minded researchers (like fans of high quality lenses) would insist that quality originate in the raw data itself. The problem is that employing a quality “computational lens” requires much more computational time and expense, and designing better “lenses” that produce better raw data is a very slow research process. It appears that the use of post-processing to correct for DFT’s shortcomings will only grow while researchers wait for more fundamental improvements to accuracy.

Creative control

sdfasdf . Photo by Mark Silva []
To get a different perspective on reality, you may need to fiddle with the camera settings. [Photo by Mark Silva / Flickr]
If you point your camera up at the night sky, open up the aperture, and take a very long exposure (or stack multiple shots) you can reveal the circular trails left by stars through the night sky as the earth rotates. These images are not accurate depictions of the sky at any one moment, but instead expose a different truth of how it rotates with time. To get these images, one must think of the camera as not just a point-and-click device but rather as a multi-purpose tool.

Creative controls also exist for the quantum calculation camera; one can artificially stretch the spacing between atoms past their normal configuration or calculate the properties of materials with magnetic configurations unknown in reality. These calculations are opportunities to predict not only what is, but what might be and see things in a different way. For example, could a battery material be charged faster if we increased the spacing between atoms?[4] Sadly, the “point-and-compute” method does not encourage this type of creative approach; those unfamiliar with the manual controls may think of DFT in a reduced vocabulary.

Organization, sharing, and the democratization of DFT

Perhaps the most unambiguous improvement of the new generation of DFT calculation will be how data is organized and shared. In the past, the results of a DFT study were buried within a maze of custom file folder hierarchies stored on a research group computing cluster that could be accessed by only a dozen people and usually navigated by only the author of the study.

We are starting to see a shift. Researchers are developing software to manage and share calculations, and large global data repositories share tens of thousands of DFT results with thousands of users via the internet (something unimaginable to many in the field only a decade ago). The audience for a once purely theoretical method is greatly expanding.

The changes to DFT calculations are not occurring as quickly or drastically as they did for photography, but they are certainly happening. Like it or not, today’s computational materials scientists will soon be sharing the field with many “point-and-compute” enthusiasts. The old guard, then, must learn to maintain the strengths of the old ways while still taking advantage of all the new opportunities.

[1] For more about the Monolith photograph as well as other photographers (like Cartier-Bresson), see this video series by Riaan de Beer.
[2] Ansel Adams shot “The Monolith” on 6.5×8.5 film which would be roughly equivalent to 550MP according to this article. Of course, whether his lens was sharp enough to capture that level of detail is another matter.
[3] For example, corrections have been developed for gas to solid phase reactions, solution to solid phase reactions, metal/element to compound reactions, and localized versus delocalized compound reactions.
[4] It turns out that increasing atom spacing can make some electrodes charge and discharge faster.

Preparing the ground for mining

The term materials informatics – referring to applying data mining techniques to materials design – has been around 7 years now. Yet, while the materials informatics approach slowly nucleates in a few institutions, a true revolution in the way most materials are designed has not arrived.

Why aren’t more people data mining materials? Some point to a lack of unified materials data sets to analyze. That situation is changing: as you read this, supercomputers around the country are systematically computing materials properties and the results are being posted openly to the web. Many are optimistic that these new databases will soon kickstart the materials informatics revolution.

However, even as the availability of materials data sets grows, an underestimated obstacle remains to materials informatics: we’ve yet to come up with a good mathematical description of materials that can be leveraged by the data mining community. Even as materials scientists expose acres of materials data for mining, it is as if these data resources are encased in hard rock that mathematicians and computer scientists can’t immediately dig into.

Encoding materials

Because data mining operates on numbers rather than on atoms or electrons, materials scientists must first encode materials in a numerical format for the machine learning community. Encoding materials is nothing new; researchers have always needed to describe materials in numbers (e.g., in research articles and to computational chemistry software). However, the details of the encoding are crucial to the success of the data mining effort. In particular, such encodings should be universalreversible, and meaningful.

digital material
Encoding materials for the data mining community is an important and perhaps underestimated challenge.


A good encoding is applicable to the entire space of the objects it hopes to describe. For materials, this problem was solved long ago. Just as an audio recording can digitally represent any piece of music using 0s and 1s, materials scientists can describe any material using three vectors to describe the repeating unit cell plus additional vectors for the positions and elemental identity of each atom in that cell.


A machine learning technique might be able to describe a potential new material with ideal properties in terms of numbers, but those numbers must have a way to transform back into a complete material. We will refer to this characteristic – that an encoded prediction can be decoded back to an object – as reversibility.

A reversible materials encoding is not difficult to achieve on its own or in conjunction with universality (the vector strategy described previously works just fine). However, both universality and reversibility become problematic when we attempt to define meaningful representations of materials.


While universality and reversibility are important for representing data, the crucial ingredient for extracting patterns and making predictions is meaningfulness of the encoding. For example, audio encodings such as MP3 are universal and reversible, but not particularly meaningful. How would an algorithm take a raw string of 0s and 1s and attempt to guess its composer or mood? It is not as if a certain binary pattern indicates Beethoven and another indicates Debussy. A machine learning expert given data encoded using formats meant for recording and playback would be hopelessly lost if he or she attempted to directly apply those encodings to machine learning.

universal/reversible vs meaningfulness in music
Unfortunately, as encoding strategies increase in meaning, they tend to become less universal and reversible.

In music, we can convey more meaning if we change our encoding strategy to be a musical score composed of note pitches, note lengths, and dynamic markings. This format lends itself to mathematically detecting note patterns that predict something meaningful such as musical genre or mood,[1] which will in turn vastly improve the success of data mining. However, a traditional musical score is not universal across all music; for example, it struggles to describe microtonal music and electronic music. It is also not fully reversible; as any fan of classical music knows, the score for Beethoven’s Ninth Symphony has been reversed into countless valid musical interpretations by different orchestras.[2] Thus, our quest for meaning has limited us in universality and reversibility.

You might instead encode music using words; a classical music piece might be described as “orchestral” or “small ensemble”, “cheerful” or “somber”, “melodic” or “twelve-tone”. These descriptors are in fact quite meaningful: we would expect that other music with the same or similar descriptors would sound similar, and such descriptors should also correlate with other properties of the music such as the country or time period it was composed in. However, despite being high in meaning, these textual descriptors are not reversible: reading a description of a piece of music does not allow you to faithfully recreate the sound in your head. In addition, given a limited musical vocabulary, text descriptors are also not universal[3] (a vocabulary of classical terms such as “twelve-tone” won’t describe rap, although it might be a more interesting universe to live in if it did).

Thus, the road to meaning often leads us away from universality and reversibility.

A “killer” encoding?

The typical description of a material as a set of vectors satisfy universality and reversibility, but are quite barren in meaning. The more meaningful materials encodings that have been devised, such as graph-based representations of materials or vectors of “chemical descriptors”, tend to be like the musical score and the verbal descriptions of music: they don’t apply to all materials and don’t capture enough detail to allow a machine learning solution to be reversed back into a material. Generating these alternate encodings also requires complex custom analysis code that can assess a material’s topology and determine its features. Therefore, each new materials informatics application requires the development of a new, meaningful data representation and lots of initial programming work before any real data mining can begin.

As materials data continues to pour in, the materials encoding problem will become more visible and might even become the primary limit to the growth of materials informatics. An opportunity exists for the ambitious materials hacker to develop a “killer” encoding: a universal, reversible representation that can serve as a meaningful “first pass” for almost any materials mining problem.

With some work, such an encoding should be possible. For example, the word2vec algorithm is an encoding for data mining that turn words into numbers.[4] The encoding is pretty universal (most words can be turned into numbers) and reversible (numbers can be turned back to words). And, it is meaningful; for example, you can use the encoding to measure meaningful distances between words (the word “France” would be closer in distance to the word “Spain” than “keyboard”). You can even take meaningful sums (sometimes) – for example, allows you to enter “Justin Bieber – man + woman” and get back an encoding that reverses to “Taylor Swift”.[5] Thus, it is certainly possible to design encodings that score well in universality, reversibility, and meaningfulness.

The word2vec algorithm is not perfect, and neither would be a similar encoding for materials. However, it would enable researchers to get shovels in the ground on a new problem quickly. The machine learning community has spent decades developing powerful and sophisticated digital drills for extracting patterns from data; the materials scientists now need to prepare the soil.

[1] For example, people have described sequences of notes using Markov chains, and then used that information to generate music of a certain style.
[2] There is a nice iOS app from Touch Press dedicated to exploring the different versions of Beethoven’s Ninth.
[3] The Music Genome Project hired musicians to describe much of the world’s music catalog using a vocabulary of about 400 words. Despite being able to power the Pandora music service with this encoding, it was not universal (Latin and classical music were two examples of holdouts).
[4] More about word2vec here.
[5] Try your luck at word additions at

Here be dragons: should you trust a computational prediction?

In a sense, computational materials science performs virtual “measurements” of a material given a description of its structure. While many such computational measurement techniques have been developed, density functional theory (DFT) is now the most popular approach to rapidly screen compounds for technology. DFT solves quantum mechanics equations that determine the behavior of electrons in a material (approximately); these electron interactions ultimately determine the material’s chemical properties. Compared to other theories, DFT strikes a good balance in terms of being accurate, transferable (not needing too many tweaks for different materials), and low in computational cost[1].

Yet, it is all too easy to list shortcomings of density functional theory. One can only model a repeating unit cell containing about 200 atoms; beyond this, the computational cost of using DFT very rapidly becomes unattainable. Thus, with DFT the richness of materials behavior at larger length scales is lost. Even for materials that can be described with so few atoms, DFT calculations are often subject to large inaccuracies. Unfortunately, there is no theory to tell you how accurate or inaccurate a particular DFT calculation will be. Finally, with computational screening you are usually unsure if the materials you designed in a computer can ever be synthesized in the lab.

With all these limitations, how should a materials researcher interpret DFT results?

Materials cartography

One might view the results of density functional theory predictions in the same way he or she might set sail with an old map of the world. That is, as a useful guide that should be taken with a healthy grain of salt. For example, we can draw parallels between Ptolemy’s 2nd century world map and the current state of the art in high-throughput calculations.

In many ways, high-throughput computations give us a map that is not unlike the early maps of the world. [Ptolemy World Map,]
Certain regions of the world are mapped better than others. Ptolemy’s map gets the northern Mediterranean roughly correct, but then starts failing catastrophically in the southern region. The Black Sea is quite accurate; the Caspian Sea is entirely wrong. Similarly, DFT works quite well for metals and many semiconductors and insulators, but starts to become very unreliable for strongly correlated materials (such as superconductors). How much you trust DFT really depends on what region of the “materials world” you are in.

Sometimes, the overall trend is correct, but the fine details are lacking. The shape of the Arabian Peninsula in Ptolemy’s map is approximately correct but doesn’t get the details right. This is often the case for certain properties computed by density functional theory. For example, when optimizing the coordinates of the atoms in a material’s structure, DFT often slightly under-predicts or over-predicts the bond lengths, but usually gets the overall solution remarkably close to that seen in experiments.

Other times, the trend can’t be trusted, but some fine details can still be recovered. The width of the Red Sea on Ptolemy’s map is not represented faithfully, particularly how it widens severely at the bottom. Yet, certain details like the Gulf of Aden can be seen from the map, and the coastline (particularly the eastern one) is not too bad. Similarly, with some materials properties (such as the band structure), DFT often misses the big picture (width of the  energy gap between conduction and valence bands) but recovers some useful details (e.g. general shape of the bands on either side, and their overall curvature).

Entire swaths of the world are completely missing. Just as the sailors of Ptolemy’s day had not ventured out to all parts of the world yet, DFT computations have not yet been performed across all potential materials. With high-throughput DFT, we are essentially sending hundreds of ships in every direction to look for entire new materials continents filled with hidden riches. However, we can’t say that the map is complete.[2]

If you are an experimentalist and are going to set sail with a DFT-based map, you’d better have a sense of adventure and realize you might have to correct course based on what you see. You might also want to get some previous guidance about the sketchier areas of the materials world. But, you’d probably rather have the map than rely only on your personal knowledge of the seas.

Slaying the dragons and sea monsters

In earlier days of mapmaking, it was not uncommon to place drawings of monsters over uncharted regions, warning the seafarer to proceed at their own risk (hence the popular phrase, “Here be Dragons”[3]). A similar warning might apply today to experimentalists interpreting DFT maps for excited state properties and strongly correlated materials.

Yet, the DFT version of the materials world is growing closer to the truth with time. While certain “monsters” remain unslayed for decades, brave DFT theorists have wounded many of them (and been knighted with a PhD). Over time, as DFT techniques improve, the monsters will start to disappear from DFT maps[4]. And as high-throughput computations set sail for new materials in new directions, more classes of interesting technological materials will become known to us. Our maps are not perfect, but they will continue to improve; in the meantime, experimentalists will have to retain their sense of adventure!

[1] The amount of time needed to compute the properties of a single material depends on the material’s complexity and on the number and type of properties desired. A reasonable range is between 100 CPU-hours (1 day on your laptop) and 10,000 CPU-hours (100 days on your laptop).
[2] For more details on why we can’t compute everything, see the previous post on “The Scale of Materials Design”.
[3] Despite being a great phrase, “Here be dragons” apparently wasn’t used that much on old maps.
[4] For an example of how DFT is progressing into difficult territory, see this recent prediction of a new superconductor.

The Scale of Materials Design

How many potential materials might there be? For those of us trying to compute our way to the next technological breakthrough, it’s important to grasp the scale of the problem. The first step is to define the term material.

The scale of materials

The objects that we interact with every day are composed of about 1023 atoms. Rather than describe all those atoms, however, materials scientists classify a material’s structure into different length scales.

Materials Scales
Materials exhibit structure on many length scales, each of which influences the overall properties.

At the fundamental scale, solid objects are composed of atoms that repeat in a pattern defined by a unit cell, much like a three-dimensional M.C. Escher tessellation.[1] The positions of atoms in the unit cell constitutes the crystal structure, whereas the elemental identity of the atoms define the chemistry. We will refer to some combination of crystal structure and chemistry as a chemical compound. In other words, a compound is a specific arrangement of atoms with chemical identities (e.g., Si, Na, Fe, etc.) in a repeating box.

A material encompasses much more than a compound, and includes additional structure at the micro- and macro- scales. At the microstructure level, the compound’s unit cells are arranged into a regular pattern within a grain. The properties of the grains and their boundaries as well as defects and impurities within the grain can have a large effect on a material’s properties. Materials can also form composites of multiple compounds at this scale. A large part of materials science concerns engineering microstructure to achieve certain properties without changing the fundamental compound(s). Similarly, the structure of the material at the macro scale – such as their surfaces – can further influence properties.

For the purposes of counting, we will restrict ourselves to counting compounds, not materials. This restriction anticipates that when we screen materials computationally, we will model all 1023 atoms of the material as one infinitely repeating unit cell. The number of possible materials would be much bigger than our estimation of the number of compounds.

Estimation method I: A googol of compounds in a box

To count compounds, we might arrange atoms in a box representing the unit cell. Let’s position 30 atoms into a 10x10x10 grid of points, selecting one of 50 elements from the periodic table[2] for each of the 30 atoms.

Packing materials in a box
Tossing 30 atoms into a 10x10x10 box using 50 possible elements gives a googol of possibilities!

It’s now straightforward to calculate the number of compound combinations. The number of potential choices of grid points for our 30 atoms is 1000C30, or about 1057, representing the possible crystal structures. Next, we assign elements to each of those atoms to get chemistry. Since we are independently assigning one of 50 elements to 30 atoms, there are 5030 ways to pick, or about 1051 possibilities. Multiplying those two numbers together, we obtain a result of about 10108 different compounds, or over a googol (10100)!

A googol is a truly unimaginable number. It is not the number of grains of sand on all the beaches on earth (~1021), or the number of hair widths to the sun (~1016). It is even more than the number of known atoms in the universe (~1080). There is no way we can compute all the combinations of 30 atoms on a 10x10x10 grid.

Estimation method II: Modify the known crystal prototypes

You wouldn’t estimate the space of potential 100,000-word novels by counting all the possibilities for words randomly strewn across pages.[3] Like a novel, a legitimate chemical compound obeys rules and patterns[4][5]. A more reasonable guess at the number of compounds might apply a few rules:

  • Rather than be arranged arbitrarily on grid points, the atoms in compounds tend to recur in the same crystal structures. New materials are very likely to reuse an old arrangement of atoms, but with different elements substituted in.
  • Known compounds, or at least those that are not nightmarish to synthesize, tend to be composed of only a few distinct elements, generally 5 or less per compound.

Applying these rules completely changes our estimation. As of this writing, the Inorganic Crystal Structure Database has classified about 6500 crystal structure prototypes, or known arrangements of atoms in a box. In terms of chemistry, if we choose 5 element compounds from a set of 50 elements, that’s 505 or 312 million combinations per prototype[6]. Multiplying those numbers together, we get about 2 trillion compounds.

At this point, we could start splitting hairs. Certainly, we don’t know all the crystal structure prototypes yet. However, even if we’ve found only 1% of crystal structures so far, we are only changing our final number by a factor of 100 (to 200 trillion). The story is similar if we want to try more than 50 elements or more than 5 element compounds. We could instead go the other way, and add further heuristic rules that reduce the estimation down from a trillion. The order of trillions of possible permutations is perhaps a reasonable middle ground, keeping in mind that our goal is just a basic estimation.

Narrowing it down

The current state-of-the-art in high-throughput computations can test about 10,000 to 100,000 materials candidates, depending on the complexity of the materials involved. So, even materials scientists armed with supercomputers must first narrow down about 1,000,000,000,000 possibilities down to 10,000 using chemical intuition[7]. Computations can narrow down that list to maybe 100. Even fewer of those will see experimental success, and fewer still will be commercialized.

Our estimate for the number of chemical compounds is not too far from the number of humans that have ever lived (~100 billion). Identifying the handful of technologically relevant compounds, then, is truly liking picking out the Da Vincis, Shakespeares, and Beethovens amongst the materials world.

[1] Many materials do not fall neatly into this description, in particular polymers and glasses. (11/20/2013)
[2] Although there are more than 100 elements on the periodic table, many of them are rare, radioactive, synthetic, or extremely expensive. 50 is a nice round number to use.
[3] However, Jorge Luis Borges already performed a very similar thought experiment in a short story titled “The Library of Babel” (4/25/2014)
[4] For example, structures tend to display symmetry; symmetric configurations tend to form an extremum in the crystal structure energy landscape which is often a minimum energy point.
[5] For examples of rules exhibited by novels, check out this wonderful article by Brian Hayes on early work by Markov. (11/26/2013)
[6] Not all prototypes support 5 elements, but we are giving them the benefit of the doubt…
[7] Hopefully, informatics will assist with this process in the future.

Why hack materials?

Although we are surrounded by the products of materials science, few of us consider the history and design of the material world. How does sand (SiO2) transform into the silicon-wafer CPUs that beat you at chess?[1] This journey is not only interesting but also important. Most technologies are in some sense limited by the materials from which they are composed.

solar panel figure
A solar panel on your roof, made up of repeating patterns of silicon atoms that give rise to a band structure. The band structure helps estimate light capture properties.

Computer chips are not the only technology employing the silicon in sand. Solar panels on a roof are likely composed of silicon atoms arranged in a repeating configuration. Those silicon atoms – arranged in just that way – give rise to fundamental materials properties like the band structure. Materials scientists relate such fundamental properties to how efficiently a material can capture light and overall device performance. What other arrangements of atoms could we put in solar panels, how would they perform, and could they lead to an energy revolution?

The needle in the haystack

Unfortunately, discovering the right combination of atoms for a technological application is a “needle in the haystack” problem; the vast majority of atom combinations are not only technologically uninteresting but also impossible to synthesize. Materials scientists and chemists therefore rely on knowledge and insight to guide the search for the few technologically relevant materials. Unfortunately, this process generally results in months or years of dead ends. Like the needle in the haystack, new materials are not easily found, and breakthroughs are rare.

Digging in the haystack

It’s difficult to find the next breakthrough material by simply testing the “haystack” of candidates. [Original image source unknown]
How might we find the “needle in the haystack” more quickly and reliably? One strategy might be to increase manpower. An analogue of this idea in the materials world is high-throughput computing. In this technique, some of the world’s most powerful supercomputers predict the properties of tens of thousands of materials by solving physical equations. The best results from the computers are retained for synthesis. Much of the “sorting through the haystack” is performed by machine, while scientists target the most promising candidates.

Army in the haystack
Hiring an army is one approach to search large haystacks faster. Replace the humans with CPUs and you have high-throughput computing. [Original source unknown, Iraqi National Guard]
Finally, might we do even better than an army digging through haystacks for needles? Indeed, even with the world’s most powerful computers, the fact remains that the number of possible materials combinations is more than we can ever compute. That’s where materials informatics might play a role – to equip the army of CPUs with metal detectors, or rather to focus computational power towards the chemical spaces most likely to yield breakthroughs. As more materials are computed, a materials informatics approach would learn what materials are likely to be successful and adapt the search accordingly. These insights could also be passed on to human researchers, who might otherwise never discover these chemical design rules.

Metal Detector Haystack [U.S. Marine Corps photo by Lance Cpl. James Purschwitz/Released]
Materials informatics is like a metal detector for finding new materials. [U.S. Marine Corps photo by Lance Cpl. James Purschwitz/Released]
It is an exciting time for materials design, one in which computations are starting to play a greater role in guiding new discoveries.  Perhaps soon, there will not only be materials scientists, but also materials hackers that apply skills from the information revolution to the hunt for materials.

[1] If you’re curious about “sand to CPU”, Intel has produced an eccentric video of the process and TechRadar has a nice article about it.

applying computing to materials design