An ensemble-based model of PM2.5 concentration across the contiguous United States with high spatiotemporal resolution

An ensemble-based model of PM2.5 concentration across the contiguous United States with high spatiotemporal resolution. Di, Q., Amini, H., Shi, L., Kloog, I., Silvern, R., Kelly, J., Sabath, M B., Choirat, C., Koutrakis, P., Lyapustin, A., Wang, Y., Mickley, L. J, & Schwartz, J. Environ. Int., 130:104909, September, 2019.
abstract bibtex

Various approaches have been proposed to model PM2.5 in the recent decade, with satellite-derived aerosol optical depth, land-use variables, chemical transport model predictions, and several meteorological variables as major predictor variables. Our study used an ensemble model that integrated multiple machine learning algorithms and predictor variables to estimate daily PM2.5 at a resolution of 1 km $×$ 1 km across the contiguous United States. We used a generalized additive model that accounted for geographic difference to combine PM2.5 estimates from neural network, random forest, and gradient boosting. The three machine learning algorithms were based on multiple predictor variables, including satellite data, meteorological variables, land-use variables, elevation, chemical transport model predictions, several reanalysis datasets, and others. The model training results from 2000 to 2015 indicated good model performance with a 10-fold cross-validated R2 of 0.86 for daily PM2.5 predictions. For annual PM2.5 estimates, the cross-validated R2 was 0.89. Our model demonstrated good performance up to 60 $μ$g/m3. Using trained PM2.5 model and predictor variables, we predicted daily PM2.5 from 2000 to 2015 at every 1 km $×$ 1 km grid cell in the contiguous United States. We also used localized land-use variables within 1 km $×$ 1 km grids to downscale PM2.5 predictions to 100 m $×$ 100 m grid cells. To characterize uncertainty, we used meteorological variables, land-use variables, and elevation to model the monthly standard deviation of the difference between daily monitored and predicted PM2.5 for every 1 km $×$ 1 km grid cell. This PM2.5 prediction dataset, including the downscaled and uncertainty predictions, allows epidemiologists to accurately estimate the adverse health effect of PM2.5. Compared with model performance of individual base learners, an ensemble model would achieve a better overall estimation. It is worth exploring other ensemble model formats to synthesize estimations from different models or from different groups to improve overall performance.

@ARTICLE{Di2019-rr,
  title    = "An ensemble-based model of {PM2.5} concentration across the
              contiguous United States with high spatiotemporal resolution",
  author   = "Di, Qian and Amini, Heresh and Shi, Liuhua and Kloog, Itai and
              Silvern, Rachel and Kelly, James and Sabath, M Benjamin and
              Choirat, Christine and Koutrakis, Petros and Lyapustin, Alexei
              and Wang, Yujie and Mickley, Loretta J and Schwartz, Joel",
  abstract = "Various approaches have been proposed to model PM2.5 in the
              recent decade, with satellite-derived aerosol optical depth,
              land-use variables, chemical transport model predictions, and
              several meteorological variables as major predictor variables.
              Our study used an ensemble model that integrated multiple machine
              learning algorithms and predictor variables to estimate daily
              PM2.5 at a resolution of 1 km $\times$ 1 km across the contiguous
              United States. We used a generalized additive model that
              accounted for geographic difference to combine PM2.5 estimates
              from neural network, random forest, and gradient boosting. The
              three machine learning algorithms were based on multiple
              predictor variables, including satellite data, meteorological
              variables, land-use variables, elevation, chemical transport
              model predictions, several reanalysis datasets, and others. The
              model training results from 2000 to 2015 indicated good model
              performance with a 10-fold cross-validated R2 of 0.86 for daily
              PM2.5 predictions. For annual PM2.5 estimates, the
              cross-validated R2 was 0.89. Our model demonstrated good
              performance up to 60 $\mu$g/m3. Using trained PM2.5 model and
              predictor variables, we predicted daily PM2.5 from 2000 to 2015
              at every 1 km $\times$ 1 km grid cell in the contiguous United
              States. We also used localized land-use variables within 1 km
              $\times$ 1 km grids to downscale PM2.5 predictions to 100 m
              $\times$ 100 m grid cells. To characterize uncertainty, we used
              meteorological variables, land-use variables, and elevation to
              model the monthly standard deviation of the difference between
              daily monitored and predicted PM2.5 for every 1 km $\times$ 1 km
              grid cell. This PM2.5 prediction dataset, including the
              downscaled and uncertainty predictions, allows epidemiologists to
              accurately estimate the adverse health effect of PM2.5. Compared
              with model performance of individual base learners, an ensemble
              model would achieve a better overall estimation. It is worth
              exploring other ensemble model formats to synthesize estimations
              from different models or from different groups to improve overall
              performance.",
  journal  = "Environ. Int.",
  volume   =  130,
  pages    = "104909",
  month    =  sep,
  year     =  2019,
  keywords = "Ensemble model; Fine particulate matter (PM(2.5)); Gradient
              boosting; Neural network; Random forest",
  language = "en"
}

% The entry below contains non-ASCII chars that could not be converted
% to a LaTeX equivalent.

Downloads: 0

{"_id":"EAgirrnSf6nzf8x8x","bibbaseid":"di-amini-shi-kloog-silvern-kelly-sabath-choirat-etal-anensemblebasedmodelofpm25concentrationacrossthecontiguousunitedstateswithhighspatiotemporalresolution-2019","author_short":["Di, Q.","Amini, H.","Shi, L.","Kloog, I.","Silvern, R.","Kelly, J.","Sabath, M B.","Choirat, C.","Koutrakis, P.","Lyapustin, A.","Wang, Y.","Mickley, L. J","Schwartz, J."],"bibdata":{"bibtype":"article","type":"article","title":"An ensemble-based model of PM2.5 concentration across the contiguous United States with high spatiotemporal resolution","author":[{"propositions":[],"lastnames":["Di"],"firstnames":["Qian"],"suffixes":[]},{"propositions":[],"lastnames":["Amini"],"firstnames":["Heresh"],"suffixes":[]},{"propositions":[],"lastnames":["Shi"],"firstnames":["Liuhua"],"suffixes":[]},{"propositions":[],"lastnames":["Kloog"],"firstnames":["Itai"],"suffixes":[]},{"propositions":[],"lastnames":["Silvern"],"firstnames":["Rachel"],"suffixes":[]},{"propositions":[],"lastnames":["Kelly"],"firstnames":["James"],"suffixes":[]},{"propositions":[],"lastnames":["Sabath"],"firstnames":["M","Benjamin"],"suffixes":[]},{"propositions":[],"lastnames":["Choirat"],"firstnames":["Christine"],"suffixes":[]},{"propositions":[],"lastnames":["Koutrakis"],"firstnames":["Petros"],"suffixes":[]},{"propositions":[],"lastnames":["Lyapustin"],"firstnames":["Alexei"],"suffixes":[]},{"propositions":[],"lastnames":["Wang"],"firstnames":["Yujie"],"suffixes":[]},{"propositions":[],"lastnames":["Mickley"],"firstnames":["Loretta","J"],"suffixes":[]},{"propositions":[],"lastnames":["Schwartz"],"firstnames":["Joel"],"suffixes":[]}],"abstract":"Various approaches have been proposed to model PM2.5 in the recent decade, with satellite-derived aerosol optical depth, land-use variables, chemical transport model predictions, and several meteorological variables as major predictor variables. Our study used an ensemble model that integrated multiple machine learning algorithms and predictor variables to estimate daily PM2.5 at a resolution of 1 km $×$ 1 km across the contiguous United States. We used a generalized additive model that accounted for geographic difference to combine PM2.5 estimates from neural network, random forest, and gradient boosting. The three machine learning algorithms were based on multiple predictor variables, including satellite data, meteorological variables, land-use variables, elevation, chemical transport model predictions, several reanalysis datasets, and others. The model training results from 2000 to 2015 indicated good model performance with a 10-fold cross-validated R2 of 0.86 for daily PM2.5 predictions. For annual PM2.5 estimates, the cross-validated R2 was 0.89. Our model demonstrated good performance up to 60 $μ$g/m3. Using trained PM2.5 model and predictor variables, we predicted daily PM2.5 from 2000 to 2015 at every 1 km $×$ 1 km grid cell in the contiguous United States. We also used localized land-use variables within 1 km $×$ 1 km grids to downscale PM2.5 predictions to 100 m $×$ 100 m grid cells. To characterize uncertainty, we used meteorological variables, land-use variables, and elevation to model the monthly standard deviation of the difference between daily monitored and predicted PM2.5 for every 1 km $×$ 1 km grid cell. This PM2.5 prediction dataset, including the downscaled and uncertainty predictions, allows epidemiologists to accurately estimate the adverse health effect of PM2.5. Compared with model performance of individual base learners, an ensemble model would achieve a better overall estimation. It is worth exploring other ensemble model formats to synthesize estimations from different models or from different groups to improve overall performance.","journal":"Environ. Int.","volume":"130","pages":"104909","month":"September","year":"2019","keywords":"Ensemble model; Fine particulate matter (PM(2.5)); Gradient boosting; Neural network; Random forest","language":"en","bibtex":"@ARTICLE{Di2019-rr,\n title = \"An ensemble-based model of {PM2.5} concentration across the\n contiguous United States with high spatiotemporal resolution\",\n author = \"Di, Qian and Amini, Heresh and Shi, Liuhua and Kloog, Itai and\n Silvern, Rachel and Kelly, James and Sabath, M Benjamin and\n Choirat, Christine and Koutrakis, Petros and Lyapustin, Alexei\n and Wang, Yujie and Mickley, Loretta J and Schwartz, Joel\",\n abstract = \"Various approaches have been proposed to model PM2.5 in the\n recent decade, with satellite-derived aerosol optical depth,\n land-use variables, chemical transport model predictions, and\n several meteorological variables as major predictor variables.\n Our study used an ensemble model that integrated multiple machine\n learning algorithms and predictor variables to estimate daily\n PM2.5 at a resolution of 1 km $\\times$ 1 km across the contiguous\n United States. We used a generalized additive model that\n accounted for geographic difference to combine PM2.5 estimates\n from neural network, random forest, and gradient boosting. The\n three machine learning algorithms were based on multiple\n predictor variables, including satellite data, meteorological\n variables, land-use variables, elevation, chemical transport\n model predictions, several reanalysis datasets, and others. The\n model training results from 2000 to 2015 indicated good model\n performance with a 10-fold cross-validated R2 of 0.86 for daily\n PM2.5 predictions. For annual PM2.5 estimates, the\n cross-validated R2 was 0.89. Our model demonstrated good\n performance up to 60 $\\mu$g/m3. Using trained PM2.5 model and\n predictor variables, we predicted daily PM2.5 from 2000 to 2015\n at every 1 km $\\times$ 1 km grid cell in the contiguous United\n States. We also used localized land-use variables within 1 km\n $\\times$ 1 km grids to downscale PM2.5 predictions to 100 m\n $\\times$ 100 m grid cells. To characterize uncertainty, we used\n meteorological variables, land-use variables, and elevation to\n model the monthly standard deviation of the difference between\n daily monitored and predicted PM2.5 for every 1 km $\\times$ 1 km\n grid cell. This PM2.5 prediction dataset, including the\n downscaled and uncertainty predictions, allows epidemiologists to\n accurately estimate the adverse health effect of PM2.5. Compared\n with model performance of individual base learners, an ensemble\n model would achieve a better overall estimation. It is worth\n exploring other ensemble model formats to synthesize estimations\n from different models or from different groups to improve overall\n performance.\",\n journal = \"Environ. Int.\",\n volume = 130,\n pages = \"104909\",\n month = sep,\n year = 2019,\n keywords = \"Ensemble model; Fine particulate matter (PM(2.5)); Gradient\n boosting; Neural network; Random forest\",\n language = \"en\"\n}\n\n% The entry below contains non-ASCII chars that could not be converted\n% to a LaTeX equivalent.\n","author_short":["Di, Q.","Amini, H.","Shi, L.","Kloog, I.","Silvern, R.","Kelly, J.","Sabath, M B.","Choirat, C.","Koutrakis, P.","Lyapustin, A.","Wang, Y.","Mickley, L. J","Schwartz, J."],"key":"Di2019-rr","id":"Di2019-rr","bibbaseid":"di-amini-shi-kloog-silvern-kelly-sabath-choirat-etal-anensemblebasedmodelofpm25concentrationacrossthecontiguousunitedstateswithhighspatiotemporalresolution-2019","role":"author","urls":{},"keyword":["Ensemble model; Fine particulate matter (PM(2.5)); Gradient boosting; Neural network; Random forest"],"metadata":{"authorlinks":{}}},"bibtype":"article","biburl":"https://gitlab.com/cchoirat/homepage/-/raw/main/choirat.bib","dataSources":["2c5xMfHetaFpYwg8Q","3yGioK9jzJGERGht9"],"keywords":["ensemble model; fine particulate matter (pm(2.5)); gradient boosting; neural network; random forest"],"search_terms":["ensemble","based","model","pm2","concentration","contiguous","united","states","high","spatiotemporal","resolution","di","amini","shi","kloog","silvern","kelly","sabath","choirat","koutrakis","lyapustin","wang","mickley","schwartz"],"title":"An ensemble-based model of PM2.5 concentration across the contiguous United States with high spatiotemporal resolution","year":2019}