I’ve been at the annual High Polymer Research Group meeting at Pott Shrigley this week; this year it had the very timely theme “Polymers in the age of data”. Some great talks have really brought home to me both the promise of machine learning and laboratory automation in polymer science, as well as some of the practical barriers. Given the general interest in accelerated materials discovery using artificial intelligence, it’s interesting to focus on this specific class of materials to get a sense of the promise – and the pitfalls – of these techniques.
Debra Audis, from the USA’s National Institute of Standards and Technology, started the meeting off with a great talk on how to use machine learning to make predictions of polymer properties given information about molecular structure. She described three difficulties for machine learning – availability of enough reliable data, the problem of extrapolation outside the parameter space of the training set, and the problem of explainability.
A striking feature of Debra’s talk for me was its exploration of the interaction between old-fashioned theory, and new-fangled machine learning (ML). This goes in two directions – on the one hand, Debra demonstrated that incorporating knowledge from theory can greatly speed up the training of a ML model, as well as improving its ability to extrapolate beyond the training set. But given a trained ML model – essentially a black box of weights for your neural network, Debra emphasised the value of symbolic regression to convert the black box to a closed form expression of simple functional forms of the kind a theorist would hope to be able to derive from some physical principles, providing something a scientist might recognise as an explanation of the regularities that the machine learning model encapsulates.
But any machine learning model needs data – lots of data – so where does that data come from? One answer is to look at the records of experiments done in the past – the huge corpus of experimental data contained within the scientific literature. Jacqui Cole from Cambridge has developed software to extract numerical data, chemical reaction schemes, and to analyse images from the scientific data. For specific classes of (non-polymeric) materials she’s been able to create data sets with thousands of entries, using automated natural language processing to extract some of the contextual information that makes the data useful. Jacqui conceded that polymeric materials are particularly challenging for this approach; they have complex properties that are difficult to pin down to a single number, and what to the outsider may seem to be a single material (polyethylene for example) may actually be a category that encompasses molecules with a wider variety of subtle variations arising from different synthesis methods and reaction conditions. And Debra and Jacqui shared some sighs of exasperation at the horribly inconsistent naming conventions used by polymer science researchers.
My suspicion on this (informed a little by the outcomes of a large scale collaboration with a multinational materials company that I’ve been part of over the last five years) is that the limitations of existing data sets mean that the full potential of machine learning will only be unlocked by the production of new, large scale datasets designed specifically for the problem in hand. For most functional materials the parameter space to be explored is vast and multidimensional, so considerable thought needs to be given to how best to sample this parameter space to provide the training data that a good machine learning model needs. In some circumstances theory can help here – Kim Jelfs from Imperial described an approach where the outputs from very sophisticated, compute intensive theoretical models were used to train a ML model that could then interpolate properties at much lower compute cost. But we will always need to connect to the physical world and make some stuff.
This means we will need automated chemical synthesis – the ability to synthesise many different materials with systematic variation of the reactants and reaction conditions, and then rapidly determine the properties of this library of materials. How do you automate a synthetic chemistry lab? Currently, a synthesis laboratory consists of a human measuring out materials, setting up the right reaction conditions, then analysing and purifying the products, finally determining their properties. There’s a fundamental choice here – you can automate the glassware, or automate the researcher. In the UK, Lee Cronin at Glasgow (not at the meeting) has been a pioneer of the former approach, while Andy Cooper at Liverpool has championed the latter. Andy’s approach involves using commercial industrial robots to carry out the tasks a human researcher would do, while using minimally adapted synthesis and analytical equipment. His argument in favour of this approach is essentially an economic one – the world market for general purpose industrial robots is huge, leading to substantial falls in price, while custom built automated chemistry labs represent a smaller market, so one should expect slower progress and higher prices.
Some aspects of automating the equipment are already commercially available. Automatic liquid handling systems are widely available, allowing one, for example to pipette reactants into multiwell plates, so if one’s synthesis isn’t sensitive to air one can use this approach to do combinatorial chemistry. Adam Gormley from Rutgers described this approach for making a library of copolymers by an oxygen-tolerant adaptation of reversible addition−fragmentation chain-transfer polymerisation (RAFT), to produce libraries of copolymers with varying polymer molecular weight and composition. Another approach uses flow chemistry, in which reactions take place not in a fixed piece of glassware, but as the solvents containing the reactants travel down pipes, as described by Tanja Junkers from Monash, and Nick Warren from Leeds. This approach allows in-line reaction monitoring, so it’s possible to build in a feedback loop, adjusting the ingredients and reaction conditions on the fly in response to what is being produced.
It seems to me, as a non-chemist, that there is still a lot of specific work to be done to adapt the automation approach to any particular synthetic method, so we are still some way from a universal synthesis machine. Andy Cooper’s talk title perhaps alluded to this: “The mobile robotic polymer chemist: nice, but does it do RAFT?” This may be a chemist’s joke.
But whatever approach one has realised to be able to produce a library of molecules with different characteristics, and analyse their properties, there remains the question of how to sample what is likely to be a huge parameter space in order to provide the most effective training set for machine learning. We were reminded by the odd heckle from a very distinguished industrial scientist in the audience that there is a very classical body of theory to underpin this kind of experimental strategy – the Design of Experiments methodology. In these approaches, one selects the optimum set of different parameters in order most effectively to span parameter space.
But an automated laboratory offers the possibility of adapting the sampling strategy in response to the results as one gets them. Kim Jelfs set out the possible approaches very clearly. You can take the brute force approach, and just calculate everything – but this is usually prohibitively expensive in compute. You can use an evolutionary algorithm, using mutation and crossover steps to find a way through parameter space that optimises the output. Bayesian optimisation is popular, and generative models can be useful for taking a few more random leaps. Whatever the details, there needs to be a balance between optimisation and exploration – between taking a good formulation and making it better, and searching widely across parameter space for a possibly unexpected set of conditions that provides a step-change in the properties one is looking for.
It’s this combination of automated chemical synthesis and analysis, with algorithms for directing a search through parameter space, that some people call a “self-driving lab”. I think the progress we’re seeing now suggests that this isn’t an unrealistic aspiration. My somewhat tentative conclusions from all this:
- We’re still a long way from an automated lab that can flexibly handle many different types of chemistry, so for a while its going to be a question of designing specific set-ups for particular synthetic problems (though of course there will be a lot of transferrable learning).
- There is still lot of craft in designing algorithms to search parameter space effectively.
- Theory still has its uses, both in accelerating the training of machine learning models, and in providing satisfactory explanations of their output.
- It’s going to take significant effort, computing resource and money to develop these methods further, so it’s going to be important to select use cases where the value of an optimised molecule makes the investment worthwhile. Amongst the applications discussed in the meeting were drug excipients, membranes for gas separation, fuel cells and batteries, optoelectronic polymers.
- Finally, the physical world matters – there’s value in the existing scientific literature, but it’s not going to be enough just to process words and text; for artificial intelligence to fulfil its promise for accelerating materials discovery you need to make stuff and test its properties.