20+ Health Related Data Sets In A Single Line of Code
Nov 29, 2022TLDR: here are 20+ interesting and easily loadable health-related datasets for you to begin exploring right away.
Introduction
Finding good health-related datasets to be explored is not an easy task... It often takes a lot of searching and digging until you find something that is trustworthy, insightful, and - most importantly - actionable.
In this article, I present a few favorite health-related data sources. For more on this topic see the related article 10 + Health Related Data Visuals In A Single Line of Code.
It is worth mentioning that - I handpicked these datasets based on their ease of use in order to allow you to easily load them with only a single line of code.
"Getting a data" quick and easily for training, testing, demonstration, or education purposes is handy.
Needless to say, loading data with only a single line of code brings a couple of particular limitations (none of them grave, but worth mentioning anyway). I discuss these limitations at the bottom of the article.
All in all, I hope this article will be extremely useful for you, so without further ado, let's get started:
Standard Data Science Python Imports
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import requests
1-3 HealthData.gov
If you are looking for fascinating and thought-provoking datasets on all sorts of health-related topics, you have to check out Healthdata.gov.
The website is run by the US Department of Health, and its goal is to make health data publicly available and easily accessible to citizens and decision-makers in the business world.
Most of their datasets come with dozens of columns and variables to be explored about a particular topic, perfect for deeply exploring topics and seeing where your curiosity takes you.
Below, let's load a few of their datasets in a single line of code.
Maltreatment & Abuse Data
pd.read_csv('https://healthdata.gov/api/views/'\
+'8bce-qw8w/rows.csv?accessType=DOWNLOAD')
For the result (excerpted):
Child Abuse Data
pd.read_csv('https://healthdata.gov/api/'\
+'views/xn3e-yyaj/rows.csv?accessType=DOWNLOAD')
For the result (excerpted):
Child Abuse Relationship Data
pd.read_csv('https://healthdata.gov/api/views/'\
+'tw7x-jbvq/rows.csv?accessType=DOWNLOAD')
For the result (excerpted):
4-9 Humanitarian Data Exchange
The Humanitarian Data Exchange is an open-source data-sharing platform created by the United Nations Office for the Coordination of Humanitarian Affairs - or OCHA for short - in order to spread awareness about key humanitarian initiatives.
As of this writing, they have more than 20,000 publicly available datasets, and in all fairness, this number is likely to continue growing as more and more people upload their contributions.
Natural Disasters Data
pd.read_csv('https://data.humdata.org/dataset/'\
+'73fcf87e-c8d7-4310-a3ed-8d201ae12246/resource/'\
+'e42f60f5-1d3c-4bc1-9a95-ac923adb78ba/download/'\
+'deaths-natural-disasters.csv',delimiter=';')
For the result (excerpted) (note the abundant NaN values):
Volcano Population Exposure
The Population Exposure Index - or PEI for short - is a widely used index for estimating the potential risk of a given volcano.
In (very) simplistic terms, this index is calculated by weighing the number of people who live nearby a particular volcano against the historical radius of damage of its past eruptions.
pd.read_csv('https://data.humdata.org/dataset/'\
+'a60ac839-920d-435a-bf7d-25855602699d/'\
+'resource/e3b1ecf0-ec47-49f7-9011-'\
+'6bbb7403ef6d/'\
+'download/volcano.csv')
For the result (excerpted):
Healthcare In Ukraine
Attacks on Ukrainian Healthcare Facilities since the start of the 2021/2022 war.
pd.read_csv('https://docs.google.com/'\
+'spreadsheets/d/e/'\
+'2PACX-1vTDw_w3n9b0_'\
+'frBtvJWtZTJGb5Bn72sZs'\
+'jJSRXLhIxMa6I1ZECFjb1LTsT'\
+'Z0PmIHiQOw4SEPCO4uIFv/'\
+'pub?gid=1932484940&single'\
+'=true&output=csv')
For the result (excerpted):
Aid effectiveness in Ukrain.
pd.read_csv('https://data.humdata.org/'\
+'dataset/018a77c2-b8f6-494a-'\
+'8376-c044b8408b20'
+'/resource/6fe80193-1f80-442a-'\
+'bc70-6b337066cc27/download/'\
+'aid-effectiveness_ukr.csv').drop(0)
For the result (excerpted):
Ukrainian Health Indicators by year.
pd.read_csv('https://data.humdata.org/'\
+'dataset/99259f69-f203-4d23-'\
+'83cc-503de6c5ecae/'\
+'resource/ec5fb208-1530-4551-'\
+'a8ba-03ad7e1d9bfc/download/'\
+'health_ukr.csv').drop(0)
Threats to Aid Workers.
Aid workers killed, injured, kidnapped and arrested in 2022.
pd.read_excel('https://data.humdata.org/'\
+'dataset/c3c3d829-a51f-4844-'\
+'b405-ef02f7746fb8/resource/'\
+'3129bb20-79ed-437b-9346-cfa8e2609a8f/'\
+'download/2022-aid-worker-kika-incident-data.xlsx')
For the result (excerpted):
10-11 Seaborn Datasets
This dataset collection is the one used by Seaborn - one of the most widely used python data visualization libraries - according to their official documentation.
Have in mind that the Seaborn dataset collection is large. There are data from other topics of interest. Here we focus on health data.
Below, I will provide you with two ways of loading the data. First, by using the seaborn built-in load_dataset method, and lastly, by loading directly from their GitHub repository.
Attention & Test Scores
Option one from the GitHub repo. Test score by attention type.
pd.read_csv('https://raw.githubusercontent.com/'\
+'mwaskom/seaborn-data/master/attention.csv')
Option two, from the sns.load_dataset() method.
df = sns.load_dataset('attention')
Both for the result (excerpted):
Brain Networks
df = sns.load_dataset('brain_networks')
Or...
pd.read_csv('https://github.com/mwaskom/'\
+'seaborn-data/raw/master/brain_networks.csv')
Both for the result (excerpted):
12-13 Data USA
Data USA is a platform to distribute, download, and create in-browser visualizations of US government data. Created in partnership with Deloitte - one of the world's top consulting firms - and the Massachusetts Institute of Technology to help everyday people have access to trustworthy data.
Much of this data won't be readily available in csv or xlsx format for you to easily download. So first, you will have to use the requests.get method to parse their page content into a pd.DataFrame object.
So with that in mind, let's load some of their datasets with only a single line of code.
Health Professionals Data
pd.DataFrame(\
requests.get(\
'https://backend-api.datausa.io/'\
+'api/data?CIP2=51&measure=Total%'\
+'20Population,Total%20Population%'\
+'20MOE%20Appx,Average%20Wage,'\
+'Average%20Wage%20Appx%20MOE,yocpop%'\
+'20RCA,Record%20Count&Workforce%'\
+'20Status=true&drilldowns'\
+'=Detailed%20Occupation&order='\
+'Total%20Population&sort=desc&'\
+'Record%20Count>=5').json()['data'])
For the result (excerpted):
From this same source and service we can also gram information about foreign healthcare professionals working in the US.
pd.DataFrame(\
requests.get(\
'https://backend-api.datausa.io/api/'\
+'data?CIP2=51&drilldowns=Nativity,'\
+'Birthplace&measures=Total Population,ycbpop RCA,'\
+'Record Count&Nativity=2&Workforce Status='\
+'true&properties=Country Code&Record Count>=5'
).json()['data'])
For the result (excerpted):
14-22 CORGIS Data Project
The CORGIS project (short for Collection of Real-World, Great, Interesting, dataSets) aggregates cool and interesting datasets from all over the internet and makes them easily downloadable.
More than a handful of brilliant minds from Virginia Tech are behind this project, and there are dozens of datasets in a myriad of topics available for download.
So if you are looking for a dataset to explore and work on your next project, please check them out.
Below, I will give you the line of code necessary to import the datasets from the CORGIS GitHub page and also point you to the original source.
COVID Data
Evolution of coronavirus cases and deaths.
# Originally sourced from
# https://data.europa.eu/
# euodp/en/data/dataset/
# covid-19-coronavirus-data
pd.read_csv('https://corgis-edu.github.io/'\
+'corgis/datasets/csv/covid/covid.csv')
For the result (excerpted):
COVID related behavior tracking survey.
# Sourced from
# https://github.com/
# YouGov-Data/covid-19-tracker
pd.read_csv('https://corgis-edu.github.io/'\
+'corgis/datasets/csv/'\
+'covid_behaviors/covid_behaviors.csv')
For the result (excerpted):
Mobility trends during COVID.
# Sourced from
# https://github.com/
# owid/owid-datasets/
# tree/master/datasets/
# Google%20Mobility%20Trends%20(2020)
pd.read_csv('https://corgis-edu.github.io/'\
+'corgis/datasets/csv/'\
+'covid_mobility/covid_mobility.csv')
Non-Covid Diseases
Non-covid diseases (world-wide).
# Sourced from
# https://aidsinfo.unaids.org/
pd.read_csv('https://corgis-edu.github.io/'\
+'corgis/datasets/csv/aids/aids.csv')
For the result (excerpted):
Non-covid diseases (United States).
pd.read_csv('https://corgis-edu.github.io/
+'corgis/datasets/csv/health/health.csv')
For the result (excerpted):
United States cancer data and statistics.
# Originally sourced from
# https://www.socialexplorer.com/explore-tables
pd.read_csv('https://corgis-edu.github.io/'\
+'corgis/datasets/csv/cancer/cancer.csv')
Drug Usage
General drug usage rates across the United States.
# Originally sourced from
# https://pdas.samhsa.gov/#/
pd.read_csv('https://corgis-edu.github.io/'\
+'corgis/datasets/csv/drugs/drugs.csv')
For the result (excerpted):
Opioid use data across the United States.
# Originally sourced from
# https://nida.nih.gov/
# research-topics/
# trends-statistics/overdose-death-rates
pd.read_csv('https://corgis-edu.github.io/'\
+'corgis/datasets/csv/opioids/opioids.csv')
For the result (excerpted):
Work Related Injuries
# Originally sourced from
# https://www.osha.gov/
# ords/odi/establishment_search.html
pd.read_csv('https://corgis-edu.github.io/'\
+'corgis/datasets/csv/injuries/'\
+'injuries.csv')
For the result (excerpted):
Limitations
This article has at least a few limitations and caveats. Here are some of them.
- Style - The code examples in this article break many coding style conventions. Do not use this article as a model for style.
- Line Continuation - For many examples this article uses line continuation. In Python the backslash \ assists with line continuation. Line wrap (or line continuation), for purposes of this article, does not count as a second line of code.
- String Concatenation - In order to make the code look good here on this platform and to avoid odd line-wrapping the article also uses string concatenation in a multiple places.
- Indexing Lists + Tables - As you can read in the documentation for pd.read_html() this code returns a list of Pandas data frames. The list index in square brackets following pd.read_html()[i], here where i represents an index on the list is what finds the data frame of interest.
- This article also uses requests.get().json() or requests.get().json()[data] to grab and identify json data from online. In a few examples here, this square bracket notation isolates the data of interest.
Thanks for reading!
Now Offering Live Free Online Data Science Lessons.
Â