Does architecture have a data problem?

Elissa Ross
MESH Consultants
Published in
6 min readOct 22, 2019

--

Like everyone everywhere these days, architects want to use machine learning. They want machine learning to design their buildings, to streamline their documentation burden, to provide realtime simulation results, and to close the feedback loop between design and engineering. Unfortunately, architects have a data problem.

Let’s back up. Isn’t it true that architects feel that they are drowning in data? After all, the practice of BIM (Building Information Models) layers 3D geometry together with metadata representing everything from material choices to environmental analyses to spatial relationships. It could be argued that contemporary architecture projects are as much about data management as they are about designing anything. And yet, beyond the project level, when it comes to a large scale effort to compile architectural data, nothing much has happened.

In fields where machine learning has seen the greatest successes — image recognition, natural language processing, etc. — there are abundant and excellent benchmark datasets available. Think of MNIST. And in fields where machine learning is showing emerging promise, like 3D shape recognition and processing, we are already seeing larger and larger datasets for testing. Benchmark datasets facilitate progress, with researchers able to test their algorithms systematically against existing standards (even Kaggle can be seen in this light, as a way to encourage data scientists to push their models further).

3D Data Sets

Consider the efforts in the computer graphics community to address the data problem. It is only in the last few years that machine learning on 3D data has started to seem tractable, with new methods, more computing power, and of course, better data. Now both TensorFlow and PyTorch have geometry/graphics modules, which contain implementations of the latest hot-off-the-arXiv algorithms for graph and manifold learning. So what about data?

Alongside this, there have been several efforts to collect data that can be used for advancing the field of machine learning on geometric structures. Here the efforts bifurcate into two categories: graph-based data sets and geometry- or mesh-based data sets. The graph-based category is already well-covered, because in fact so much data is naturally encoded in graphs (social networks, citation databases, molecular structures, etc.). The geometry category has been slower to evolve, although there are more and more repositories of open-source geometry online (primarily in the domain of 3D printing). Here is a summary table of different 3D geometry repositories:

The ABC Dataset in particular is a large and open database consisting of CAD models combed from publicly available collection of Onshape, a cloud-based tool primarily used for product design and development. Similarly, Thingi10K is a smaller dataset collected from Thingiverse, an online 3D printing service.

Architectural Datasets

While the datasets for 3D geometry have come a long way, we can ask whether these datasets would be appropriate for architectural exploration. Even on geometric tasks, the types of meshes that are produced by architects tend to be different than those produced for the purpose of 3D printing or product development. In an architectural setting, mesh features such as faces may represent built elements like windows, doors or facade panels. Coincident points, lines or planes are common because we can think of built structures as inherently non-manifold (as a simple example, consider a building represented as a rectangular prism sliced into “floors”). So this leads me to ask: what would a good geometric dataset for architectural learning look like?

Certainly there have been numerous examples of floor plan generation models using machine learning. These have been interesting explorations, and have provided a good testing ground for the question of whether machine learning may be interesting to architects (answer: yes!). But if machine learning is going to be of true utility to architects, it will need to move beyond floor plans or furniture layouts, and we will need to move beyond these basic datasets.

A type of dataset we could consider would be full BIM models. In this case all of the geometric considerations still apply, together with the layers of rich metadata encoded in the model. The next step of this process would be a dataset of as-built conditions, likely in the form of scan data. And finally, we could collect building performance metrics such as energy use, etc., for the purpose of comparing them against earlier simulation results. With the right analysis, this type of dataset could be incredibly powerful for providing information at the design stage about building performance and outcomes.

Unfortunately, the existing architectural data sets remain small. The DURAARK (Durable Architectural Knowledge) project was one effort to formalize a methodology for producing a rich data set consisting of BIM models with as-built information layered on. The emphasis there was on the durability of architectural data, the aim being long-term preservation of data from the design phase, through construction, use, refurbishment and beyond. While the potential utility of such an effort is clear, the actual data set produced by this project consists of only 49 IFC models. What would it take to gather more data, to truly create a useful dataset for machine learning, for analysis, and other applications?

At MESH we were recently in conversation with a large architectural practice that was looking to use machine learning to speed up feasibility studies. They have been doing these studies for decades, and have collected a substantial data set. They were interested to consider whether ML could shed light on what makes buildings they have worked on “theirs”. Although it would be an interesting project, the conclusion of our meeting was that despite having thousands of detailed models, in fact they may not really have enough data to extract meaningful inferences. When asked if they would consider pooling their data with that of another firm, the answer was a firm no (more like an appalled “never!!”). But until firms are willing to open up their repositories, it seems unlikely that the datasets collected by individual outfits will be sufficient.

Architecture and Climate Change: the case for yet more data

In case you aren’t already convinced that we need a consolidated effort to produce larger, richer and better datasets for architectural analyses and machine learning projects, let me add one more reason: Climate change. The buildings and buildings construction sectors together may account for more than 40% of total direct and indirect carbon emissions. Critically, this energy demand is growing due to increased construction in the developing world, growth in total floor plan areas and increased use of energy consuming devices. It is essential that architects, and the AECO industry more broadly, address this issue headfirst in an urgent and fundamental way.

How can more data help address climate change? Quite simply, the more data we have, the more statistical relationships we are able to observe (for instance the link between design geometry and energy use), and the better decisions we can make. If we can help architects use this information and make better decisions in the early stages of planning a project, we may have some small hope of meeting some of the lofty goals of Architecture 2030 for example.

Let’s do this together

It hardly feels like my place to be trying to convince architects to “all just get along”, I’m just a mathematician after all. But from an outside perspective, I’d love to see a way to break down a bit of siloization of the architecture industry, to come together with our geometry and our models in a meaningful way to move the discipline, and the earth, toward better, more sustainable design.

This post was written as part of research for a presentation at the AEC Tech Symposium in NYC 2019. The summary of datasets for 3D geometry was inspired by a talk given by Qingnan Zhou of Adobe Research.

--

--