Hierarchical Reinforcement Learning Advances Flowsheet Design
Over recent decades, the rise of machine learning has transformed numerous engineering disciplines, including process engineering. While artificial neural networks have proven effective as surrogate models for simulations or predicting thermodynamic properties, reinforcement learning (RL) offers a distinct advantage for creative design tasks where neither comprehensive datasets nor mechanistic models exist. In process control, RL has been widely adopted, yet its application to process synthesis remains comparatively rare.

The work described here builds on the SynGameZero framework, a method that recasts flowsheet synthesis as a competitive, turn-based two-player game. In this setup, each player receives identical feed streams and seeks to construct a more profitable process flowsheet than the opponent, adding unit operations step-by-step. A predefined cost function determines the winner. The original approach combined an artificial neural network with Monte Carlo tree search, enabling the agent to learn flowsheet construction strategies entirely through self-play.
Earlier implementations faced scalability challenges when the design space expanded to include recycles or continuous process parameters. The flat action space—where location, unit type, and specifications were chosen simultaneously—grew exponentially with complexity. To address this, the current work integrates hierarchical reinforcement learning (HRL), structuring decisions into three levels: first, selecting a location or terminating; second, choosing the unit operation; and third, specifying unit parameters when required.
The agent architecture combines multiple actor-critic networks (ACNs) with a convolutional neural network (CNN) for state encoding, feeding into a tree search for forward planning. The CNN processes flowsheet matrices, while each ACN outputs a probability distribution for possible actions and an estimated reward. Infeasible actions are filtered before entering the tree search, which expands nodes by simulating the impact of chosen actions. The ?-greedy policy guides exploration, with ? fixed at 0.1, and the search depth parameter K set to 30.
Training occurs within a Python-based sequential-modular flowsheet simulator, incorporating tear streams for recycles. The process example centers on ethyl tert-butyl ether (ETBE) synthesis from ethanol and a mixture of isobutene and n-butane. Available unit operations include reactors, two types of distillation columns, mixers, and recycles. The reactor model assumes equilibrium at 50?°C, while distillation models use infinite-stage, total-reflux analysis at 8?bar, accounting for azeotropic boundaries in the quaternary system.
Economic evaluation uses net present value, factoring investment and operating costs. Investment costs scale with mass flowrate, with base values adapted from literature for reactors and distillation columns. Operating costs include steam consumption for distillation, estimated from the energy required to evaporate distillate components. Product streams are valued according to purity, with discounts applied to impure outputs.
The agent was trained on randomly sampled feed stream compositions, then evaluated against three benchmark flowsheets devised by the authors based on literature designs. Across 1,000 test cases, the agent matched or exceeded the best benchmark in roughly 97?% of cases, with an average net present value improvement of 23?%. In some instances, the agent identified profitable flowsheets that did not produce ETBE at all, instead separating and selling feed components directly—solutions overlooked by human designers.
Examples illustrate the agent’s adaptability: with excess isobutene and n-butane, it separated and sold them as pure products; with limited isobutene, it employed recycling to boost ETBE yield; and in cases of scarce reactants, it avoided costly distillation entirely, opting for simpler separations. These strategies often delivered significant economic gains over benchmarks.
Analysis of the agent’s decision patterns across feed compositions revealed structured, repeatable flowsheet choices, except in rare equimolar cases where unnecessary units sometimes appeared due to low encounter frequency during training. Despite this, performance remained strong across the composition space.
The hierarchical decision structure markedly improved scalability, enabling the agent to navigate vast combinatorial possibilities without exhaustive search. By reducing parameters through CNN-based state encoding and retaining the competitive game format with tree search, the method achieved efficient exploration and creative problem-solving. Future work could extend the framework to handle continuous design parameters, further broadening its applicability to complex process synthesis challenges.
