Global fire-vegetation models are widely used to assess impacts of environmental change on fire regimes and the carbon cycle and to infer relationships between climate, land use and fire. However, differences in model structure and parameterizations, in both the vegetation and fire components of these models, could influence overall model performance, and to date there has been limited evaluation of how well different models represent various aspects of fire regimes. The Fire Model Intercomparison Project (FireMIP) is coordinating the evaluation of state-of-the-art global fire models, in order to improve projections of fire characteristics and fire impacts on ecosystems and human societies in the context of global environmental change. Here we perform a systematic evaluation of historical simulations made by nine FireMIP models to quantify their ability to reproduce a range of fire and vegetation benchmarks. The FireMIP models simulate a wide range in global annual total burnt area (39–536 Mha) and global annual fire carbon emission (0.91–4.75 Pg C yr−1) for modern conditions (2002–2012), but most of the range in burnt area is within observational uncertainty (345–468 Mha). Benchmarking scores indicate that seven out of nine FireMIP models are able to represent the spatial pattern in burnt area. The models also reproduce the seasonality in burnt area reasonably well but struggle to simulate fire season length and are largely unable to represent interannual variations in burnt area. However, models that represent cropland fires see improved simulation of fire seasonality in the Northern Hemisphere. The three FireMIP models which explicitly simulate individual fires are able to reproduce the spatial pattern in number of fires, but fire sizes are too small in key regions, and this results in an underestimation of burnt area. The correct representation of spatial and seasonal patterns in vegetation appears to correlate with a better representation of burnt area. The two older fire models included in the FireMIP ensemble (LPJ–GUESS–GlobFIRM, MC2) clearly perform less well globally than other models, but it is difficult to distinguish between the remaining ensemble members; some of these models are better at representing certain aspects of the fire regime; none clearly outperforms all other models across the full range of variables assessed.