How it works

There is no complete dataset for workforce development training providers. To get around this, we have combined major publicly available data sources to create a comprehensive new dataset, which is referred to in this methodology as the ‘novel dataset’. Our priority is to ensure that this dataset is a public good that others in the sector can utilize. Therefore, we focus on publicly available data sources.

Creating the novel dataset

We used four main public data sources to create our novel dataset. These were chosen because they provided information that matched our working definition of workforce development training providers. The types of providers we looked for include community colleges, technical colleges , non-profit training providers, private training providers, and federally registered apprenticeships, including those provided by industry associations, unions and, sometimes, state and local governments.


Here's a little more detail about each of the data sources we used:


1. Integrated Postsecondary Database System (IPEDS)

Data used: Survey Year 2020
Extract Date: 2022-07-27

This dataset has information about higher education institutions in the US. Any institution that has applied for or received federal student support funds is required to submit data to IPEDS. Even though the data is self-reported and somewhat limited, it's still the most complete dataset on American post-secondary education institutions. We used a subsection of IPEDS that predominantly includes institutions that mainly offer degrees or certificates under bachelor's level, like community and technical colleges.



2. Registered Apprenticeship Partners Information System (RAPIDS)

Data used: Fiscal Year 2020
Extract Date: 2022-07-27

This is a Department of Labor list of Registered Apprenticeships . These apprenticeships provide career pathways in high-demand fields. They are approved by the Department of Labor or State Apprenticeship Agencies and must be industry-vetted and paid. They can be offered by a variety of providers, including employers, unions, colleges, and occasionally, state and local governments. Our dataset includes all providers listed in RAPIDS.



3. Internal Revenue Service (IRS)

Data used: Exempt Organizations Business Master File (EO BMF) updated data posting year 2022 and IRS 990 Series via Google BigQuery
Extract Date: 2022-07-27

The IRS publishes a list of tax-exempt organizations that have to complete and submit Form 990 each year. We included organizations related to workforce training and employment in our dataset, such as vocational and job training non-profits and technical schools.



4. TrainingProviderResults.gov (TPR)

Data used: Version 3.0 updated 7/6/2022. State submitted reports for Program Year 2021
Extract Date: 2022-07-27

This new Department of Labor dataset lists eligible training providers where workers can use federal vouchers to enroll. This includes higher education institutions, national apprenticeships, private non-profits, and public training providers. TPR also includes private for-profit training providers, which are typically not found in other public datasets. We include all TPR providers in our dataset, except for those that mainly offer bachelor's degrees.


Merging and cleaning data

We began by combining the four different data sources.

The second step was deduplication (also known as record linkage), where we identified the duplicate records of an organization and selected one of the records while getting rid of the duplicates. This was a lengthy and complex process with multiple iterations, but it was systematic (SQL-based) and fully reproducible.

The deduplication started with the standardization of the organization's addresses. To do this, we used the US Census Bureau’s GeoCoder Batch API to standardize each provider's address. Then we used a combination of Jaro-Winkler and Levenshtein distance algorithms to find providers with similar names at the same address. If the names were 60% or more similar, we assumed they were duplicates and assigned them the same unique identifier (dedupe_id in the dataset). We manually and thoroughly reviewed the results of this process but weren’t satisfied with its performance as it still allowed some duplicates to be present in the data source. Therefore, we repeated this process by reducing the same address requirement to only the same state and city and increasing the name similarity threshold to 90%. This allowed us to eliminate duplicate records further.

Some unique name-address combinations could be found in more than one data source. To solve that and ensure we had one unique unit of analysis per row, we prioritized unique name-address combinations from certain data sources and removed their duplicates from other data sources as in Table 1.



Table 1 - Deduplication based on data source prioritization

Step

Is this organization found in the following source and category (even if also found in others)?

 

If yes, keep the entry as found in that source and category and remove duplicates in other sources.

 

If no, move to the next step.

1

IPEDS

2

IRS – B41: Community or junior colleges

3

TPR – One of the three 'higher ed' types

4

RAPIDS – Community college / University

5

IRS – All other categories

6

TPR – Private non-profit

7

RAPIDS – Community-based organization

8

RAPIDS – All other categories

9

TPR – National apprenticeship

10

TPR – Private for-profit

11

TPR – Public

12

TPR – Other or Null




Creating a single classification system

We recorded the sources in which information about providers was found in four logic variables: in_ipeds, in_rapids, in_irs and in_tpr. Before removing duplicates across different data sources, we captured, within these logic variables, when unique providers appeared in multiple sources. We also recorded the provider subtype, as recorded in the original data source, in the variables org_subtype_ipeds, org_subtype_in_rapids, org_subtype_in_irs and org_subtype_in_tpr. Then we used this information to bring organizations into a single, new classification system made of four types and 14 subtypes.

Note that:

  • These classifications are not mutually exclusive: organizations may be classified as more than one type, or more than one subtype.
  • Subtypes are not subsets of types, meaning organizations we classified as one subtype may come from any of the four types.
  • Some organizations are not classified into any subtype.

The four organization types

Type

Definition

  1. Higher education institution

in_ipeds == true

  1. Registered apprenticeship

in_rapids == true

  1. WIOA-eligible

in_tpr == true

  1. Non-profit organization

in_irs == true



The 14 organization subtypes

Subtype

Definition

  1. Highest degree certificate

org_subtype_ipeds == "Nondegree-granting, sub-baccalaureate"

  1. Highest degree associate’s

org_subtype_ipeds == "Degree-granting, Associate's and certificates" 

  1. Highest degree bachelor’s+

org_subtype_ipeds == "Degree-granting, not primarily baccalaureate or above" 

  1. Other higher education institution

org_subtype_ipeds == 0 & org_subtype_irs == “Community or Junior Colleges”

org_subtype_ipeds == 0 & org_subtype_tpr == “Higher Ed*”

org_subtype_ipeds == 0 & org_subtype_rapids == “Community College/University”

  1. Private for-profit

org_subtype_tpr == “Private for profit”

  1. Apprenticeship sponsor / labor/union

org_subtype_rapids == “Apprenticeship –Labor/Union”

  1. Apprenticeship – business association

org_subtype_rapids == “Apprenticeship – Business Association”

  1. Apprenticeship sponsor / employer

org_subtype_rapids == “Apprenticeship – Employer”

  1. Apprenticeship sponsor / intermediary

org_subtype_rapids == “Apprenticeship – Intermediary”

  1. Apprenticeship sponsor / government

org_subtype_rapids == “Apprenticeship – Federal Agency” | “Apprenticeship – City/County Agency” | “Apprenticeship – State Agency”

  1. Apprenticeship sponsor / workforce investment board

org_subtype_rapids == “Apprenticeship – Workforce Investment Board”

  1. Apprenticeship sponsor / foundation

org_subtype_rapids == “Apprenticeship – Foundation”

  1. Apprenticeship sponsor / other

org_subtype_rapids == “Apprenticeship – Other” | “Apprenticeship – None” | “Apprenticeship – Unknown”

org_subtype_tpr == “National Apprenticeship”

  1. Job training non-profit

org_subtype_irs == “Vocational, Technical Schools” | “Employment Procurement Assistance, Job Training” | “Vocational Counseling, Guidance and Testing” | “Vocational Training” | “Vocational Rehabilitation” | “Goodwill Industries” | “Sheltered Remunerative Employment, Work Activity Center N.E.C.”

org_subtype_tpr == “private non-profit”

org_subtype_rapids == “community based organization”



Testing the novel dataset

Once the data was combined and cleaned, we tested its quality in two ways.

First, we checked if we had removed any non-duplicate providers. Our 90% similarity cutoff eliminated some unique providers with very similar names and addresses. There is no systematic way to correct this, so our new dataset eliminates a very small percentage (less than 0.5%) of unique training providers. We also checked if we had failed to eliminate any duplicate training providers at this stage.

Second, we checked if our dataset was relevant. Even though we had tried only to include providers that matched our definition of workforce development training organizations. we knew some might not align. So, we randomly selected 2.5% of the dataset to check them. This helped us understand which parts of our definition were well represented and which parts were missing or underrepresented, and if we needed to clean the data in any systematic way to fix this. After this check, we removed some providers that operated outside the US, or served high school students. We did this by searching for specific keywords in provider names, including “high school,” “adult school,” and “adult education”. Finally, we removed providers that appear to finance the sector or help workers search for jobs rather than train for employment. These include foundations, funds, or trusts since these providers tend to fund training provision rather than offering training themselves, as well as providers identified as career centers.



Notes on data coverage

Our new dataset contains 16,781 training providers from all over the US. It includes higher education institutions, WIOA-eligible organizations, non-profit training providers, and registered apprenticeship providers.

Our dataset should be comprehensive for publicly funded providers, but it might miss some private, for-profit institutions due to lack of available data on these providers.

Higher education institutions are well represented because of mandatory data reporting to IPEDS for institutions that receive Title IV funding. IPEDS also includes some institutions that do not receive Title IV funding, though this reporting is not mandatory.

Private non-profits are also largely covered in our dataset. We made sure to include all relevant IRS Core Codes and subcodes, consulting experts in the field for guidance on which to select. It is possible, however, that we missed some non-profits related to workforce development or included some that don't provide training. Generally, we leaned on the side of inclusion over exclusion in this group.

Registered Apprenticeships from the Department of Labor are all included in our dataset. However, the inclusion rate of unregistered apprenticeships is uncertain. We've included unregistered programs that are eligible for WIOA-eligible Individual Training Accounts, but there are likely to be many informal apprenticeships that we couldn't account for.

Additionally, we are largely unable to include company- or employer-provided training, both because there is no publicly-available dataset that captures this type of training and because it is nearly impossible to ensure that this programming is targeted at this group of workers, who do not necessarily have a bachelor’s degree.

Our dataset may have inaccuracies due to self-reporting biases, as training providers and data collection entities may make specific choices (for example, reporting only data about organizations' headquarters) and mistakes (such as typos in addresses, self-attributing a different or unknown organization type, and so on) as they record this data in original data sources.

Finally, our dataset might include some providers that are no longer operational, as the data collection occurred during the Covid-19 pandemic, a time of significant change for the sector. While data from IPEDS and the IRS are updated annually, updates for the Department of Labor Registered Apprenticeships data rely on providers' choosing to remove themselves. In a preliminary analysis, we found that some training providers are now defunct.