API Reference

This page provides an overview of the main modules and functions in wikipediaGATN.

Network Level Functions

Network-level BFS crawling of airport Wikipedia pages.

This module drives the breadth-first expansion of the airport network. Starting from a seed IATA code, it iteratively fetches each airport’s destinations (both passenger and cargo), saves the results as <CODE>.<level>.json files in TEMP_RESULTS_DIR/airports_rooted_sweep, and tracks progress in processed_locations.csv so that interrupted runs can be resumed.

Typical usage:

from wikipediaGATN.wikipedia_network_level import iterate_search_until_distance_N

# Crawl two hops out from Winnipeg
iterate_search_until_distance_N("YWG", dist=2, delay=0.5, verbose=True)

Functions

clean_output_directory delete scraped files to start fresh get_connections_level_N expand one BFS level check_processed_list deduplicate / clean progress CSV iterate_search_until_distance_N crawl to a fixed depth iterate_search_until_empty crawl until no new airports are found continue_existing_search_one_step resume a partially-complete crawl by one step continue_existing_search_until_empty resume and run to completion

wikipediaGATN.wikipedia_network_level.check_processed_list(verbose: bool = False) None[source]

Deduplicate and clean processed_locations.csv.

  • Exports rows with iata == "None" to failed_lookups.csv.

  • Removes those rows and any duplicate URLs from the main file.

  • Re-sorts by (iata, url).

Parameters:

verbose (bool, optional) – Print summary counts. Default: False.

wikipediaGATN.wikipedia_network_level.clean_output_directory(levels=None, verbose: bool = False) int[source]

Delete scraped airport JSON files from TEMP_RESULTS_DIR.

Also removes processed_locations.csv so the next run starts fresh.

Parameters:
  • levels (list of int or None, optional) – If None (default), removes all .<N>.json files. If a list of integers is given, only files at those BFS levels are removed (e.g. levels=[2, 3]).

  • verbose (bool, optional) – Print a summary of what was removed. Default: False.

Returns:

Total number of JSON files removed.

Return type:

int

wikipediaGATN.wikipedia_network_level.continue_existing_search_one_step(delay: float = 1.0, verbose: bool = False) None[source]

Resume a partially-complete crawl by processing one additional BFS step.

Finds the highest level N already present in TEMP_RESULTS_DIR and re-runs get_connections_level_N(from_length=N-1) — stepping back one level ensures the previous frontier is complete before advancing.

Parameters:
  • delay (float, optional) – Seconds between Wikipedia requests. Default: 1.0.

  • verbose (bool, optional) – Print progress. Default: False.

wikipediaGATN.wikipedia_network_level.continue_existing_search_until_empty(delay: float = 1.0, verbose: bool = False) None[source]

Resume a partially-complete crawl and run to completion.

Finds the highest BFS level N already present in TEMP_RESULTS_DIR and continues expanding from that point until no new airports are found.

Parameters:
  • delay (float, optional) – Seconds between Wikipedia requests. Default: 1.0.

  • verbose (bool, optional) – Print progress. Default: False.

Notes

Assumes the current highest level is already complete. If it is not, use continue_existing_search_one_step() first.

wikipediaGATN.wikipedia_network_level.get_connections_level_N(from_length: int = 0, delay: float = 1.0, verbose: bool = False) int[source]

Expand the airport network by one BFS level.

For every airport file at level from_length (<CODE>.<from_length>.json), fetch each listed destination (both passenger and cargo) that has not yet been processed and save its data as <CODE>.<from_length+1>.json.

Parameters:
  • from_length (int, optional) – Source BFS level. Default: 0.

  • delay (float, optional) – Seconds to sleep between Wikipedia requests. Default: 1.0.

  • verbose (bool, optional) – Print per-destination progress. Default: False.

Returns:

Number of new destination files written.

Return type:

int

wikipediaGATN.wikipedia_network_level.iterate_search_until_distance_N(seed_iata: str, dist: int = 1, delay: float = 1.0, verbose: bool = False) None[source]

Crawl the airport network to a fixed BFS depth.

Parameters:
  • seed_iata (str) – IATA code of the starting airport (e.g. "YWG").

  • dist (int, optional) – Maximum BFS depth. dist=1 fetches only direct connections from the seed. Default: 1.

  • delay (float, optional) – Seconds to sleep between Wikipedia requests. Default: 1.0.

  • verbose (bool, optional) – Print per-airport progress. Default: False.

wikipediaGATN.wikipedia_network_level.iterate_search_until_empty(seed_iata: str, delay: float = 1.0, verbose: bool = False) None[source]

Crawl the airport network until no new airports are discovered.

Parameters:
  • seed_iata (str) – IATA code of the starting airport.

  • delay (float, optional) – Seconds between Wikipedia requests. Default: 1.0.

  • verbose (bool, optional) – Print per-airport progress. Default: False.

Notes

For a global crawl this may run for many hours. Use iterate_search_until_distance_N() if you want a bounded run.

Result Processing

Orchestration functions for GATN (Global Air Transportation Networks) generation.

This module provides the complete pipeline to extract network structures from the parsed airport data. It drives the multi-pass IATA code recovery workflow, generates outbound connection lists, builds sparse adjacency matrices, and exports rich network graphs for both the Passenger (Pax) and Cargo networks.

wikipediaGATN.result_processing_network.run_two_pass_iata_extraction(batch_size: int = 50, delay: float = 0.5, verbose: bool = False) dict[source]

Execute Passes 2 and 3 of the IATA recovery workflow.

Assumes Pass 1 (create_outbound_connections_list()) has already been run and unmapped_destinations.csv exists.

Pass 2

Attempts instantaneous offline OurAirports lookup for unmapped URLs, falling back to fetching the Wikipedia page to extract the IATA code.

Pass 3

Filters successful extractions by confidence and writes manual_airport_mapping.csv for use in the next create_outbound_connections_list() call.

Parameters:
  • batch_size (int, optional) – Number of HTTP requests before a longer pause is inserted to respect Wikipedia’s servers. Default: 50.

  • delay (float, optional) – Per-request delay in seconds. Default: 0.5.

  • verbose (bool, optional) – If True, prints detailed per-URL progress. Default: False.

Returns:

{'extraction_result': dict, 'mapping_count': int}

extraction_result is the dict returned by extract_iata_from_unmapped_destinations() (keys: total, successful, skipped, failed, csv_path).

mapping_count is the number of entries written to manual_airport_mapping.csv.

Return type:

dict

Airport Level Functions

Airport-level Wikipedia scraping and data extraction functions.

Functions in this module interact with the Wikipedia API to fetch and parse airport pages, and supplement the extracted data using the authoritative OurAirports database.

Core Pipeline Functions:

  1. fetch_wikipedia_airport_link() resolve an identifier to a URL

  2. fetch_wikipedia_airport_html() fetch parsed HTML

  3. fetch_wikipedia_airport_wikitext() fetch raw wikitext

  4. fetch_wikipedia_airlines() set of airline names

  5. fetch_wikipedia_destinations() set of (name, URL) tuples

  6. fetch_wikipedia_airlines_destinations() airline → destinations map

  7. fetch_wikipedia_airport_info() all metadata in one dict

  8. save_airport_info() persist dict to JSON + progress CSV

OurAirports Integration & Validation:

  • infer_missing_geographic_data() supplement missing data via OurAirports

  • compare_airports_with_ourairports() audit extracted data against OurAirports

  • find_active_missing_airports() identify unmapped airports in OurAirports

  • build_url_to_codes_map() map Wikipedia URLs to IATA codes

Helper / Fallback functions:

wikipediaGATN.airport_level_functions.clean_infobox_value(value: str) str[source]

Normalise a wikitext infobox value string.

  • {{nowrap|...}} -> inner content

  • {{Unbulleted list|...}} -> comma-separated items

  • {{URL|...}} -> bare URL

Parameters:

value (str) – Raw wikitext value.

Returns:

Cleaned string with wikilinks preserved as [[X]].

Return type:

str

wikipediaGATN.airport_level_functions.fallback_fetch_wikipedia_airport_info(html_content: str) dict[source]

Extract basic airport info from HTML when the infobox cannot be parsed.

Parameters:

html_content (str) – Parsed HTML of the airport Wikipedia page.

Returns:

Keys: iata, icao, serves, location, coordinates, wikipedia_url. Values are strings or None.

Return type:

dict

wikipediaGATN.airport_level_functions.fetch_wikipedia_airlines(identifier: str = 'YWG', link=None, html_content=None, verbose: bool = False, soup=None) set[source]

Extract airline names from an airport’s Wikipedia page.

Parameters:
  • identifier (str, optional) – IATA/ICAO code, Wikipedia URL, or name. Default: "YWG".

  • link (str or None, optional) – Wikipedia page URL (fetched automatically if absent).

  • html_content (str or None, optional) – Pre-fetched HTML (fetched automatically if absent).

  • verbose (bool, optional) – Print progress. Default: False.

  • soup (BeautifulSoup or None, optional) – Pre-parsed BeautifulSoup object. Default: None.

Returns:

Airline names extracted from the Airlines and destinations table.

Return type:

set of str

wikipediaGATN.airport_level_functions.fetch_wikipedia_airlines_destinations(identifier: str = 'YWG', link=None, html_content=None, verbose: bool = False, soup=None) dict[source]

Extract an airline to destinations mapping from an airport’s Wikipedia page.

Falls back to parse_fallback_nlp_airlines_destinations() if no table-based data is found.

Parameters:
  • identifier (str, optional) – IATA/ICAO code, Wikipedia URL, or name. Default: "YWG".

  • link (str or None, optional) – Wikipedia page URL (fetched automatically if absent).

  • html_content (str or None, optional) – Pre-fetched HTML (fetched automatically if absent).

  • verbose (bool, optional) – Print progress. Default: False.

  • soup (BeautifulSoup or None, optional) – Pre-parsed BeautifulSoup object. Default: None.

Returns:

{"passenger": {airline_name: {destination_name, ...}, ...}, "cargo": {...}}

Return type:

dict

wikipediaGATN.airport_level_functions.fetch_wikipedia_airport_html(link: str, verbose: bool = False)[source]

Fetch the parsed HTML content of a Wikipedia page.

Parameters:
  • link (str) – Wikipedia page URL (https://en.wikipedia.org/wiki/<Title>).

  • verbose (bool, optional) – Print progress messages. Default: False.

Returns:

HTML string, or None on failure.

Return type:

str or None

wikipediaGATN.airport_level_functions.fetch_wikipedia_airport_info(identifier: str = 'YWG', link=None, verbose: bool = False) dict[source]

Extract all available metadata for an airport from its Wikipedia page.

Parameters:
  • identifier (str, optional) – IATA/ICAO code, Wikipedia URL, or name. Default: "YWG".

  • link (str or None, optional) – Wikipedia page URL (fetched automatically if absent).

  • verbose (bool, optional) – Print progress. Default: False.

Returns:

Keys: iata, icao, city-served, location, lat, lon, altitude, region, country_alpha3, country_name, subdivision_code, wikipedia_url, airlines, destinations, airlines_destinations.

Return type:

dict

Resolve an airport identifier to its Wikipedia page URL.

Resolution order:

  1. If identifier is already a Wikipedia URL, the title is decoded from it.

  2. If identifier matches [A-Za-z]{3} or [A-Za-z]{4} (IATA/ICAO), search for "<CODE> airport".

  3. Otherwise, treat identifier as a free-text name; append " airport" if the word is not already present.

Parameters:
  • identifier (str) – IATA code, ICAO code, Wikipedia URL, or free-text airport name.

  • verbose (bool, optional) – Print search term and result. Default: False.

Returns:

Canonical Wikipedia URL, or None if no page was found.

Return type:

str or None

wikipediaGATN.airport_level_functions.fetch_wikipedia_airport_wikitext(link: str, verbose: bool = False)[source]

Fetch the raw wikitext source of a Wikipedia page.

Parameters:
  • link (str) – Wikipedia page URL.

  • verbose (bool, optional) – Print progress messages. Default: False.

Returns:

Wikitext string, or None on failure.

Return type:

str or None

wikipediaGATN.airport_level_functions.fetch_wikipedia_destinations(identifier: str = 'YWG', link=None, html_content=None, verbose: bool = False, soup=None) set[source]

Extract destination (name, Wikipedia URL) pairs from an airport page.

Parameters:
  • identifier (str, optional) – IATA/ICAO code, Wikipedia URL, or name. Default: "YWG".

  • link (str or None, optional) – Wikipedia page URL (fetched automatically if absent).

  • html_content (str or None, optional) – Pre-fetched HTML (fetched automatically if absent).

  • verbose (bool, optional) – Print progress. Default: False.

  • soup (BeautifulSoup or None, optional) – Pre-parsed BeautifulSoup object. Default: None.

Returns:

(destination_name, wikipedia_url) pairs.

Return type:

set of tuple[str, str]

wikipediaGATN.airport_level_functions.format_airport_json(data: dict) dict[source]

Enforce a strict ordering of JSON keys for airport data output.

wikipediaGATN.airport_level_functions.parse_fallback_nlp_airlines_destinations(html_content: str, verbose: bool = False, soup=None) set[source]

Use spaCy NER to extract (airline, destination) pairs as a last resort.

Parameters:
  • html_content (str) – Parsed HTML of the airport Wikipedia page.

  • verbose (bool, optional) – Print match counts. Default: False.

  • soup (BeautifulSoup or None, optional) – Pre-parsed BeautifulSoup object. Default: None.

Returns:

(ORG entity, GPE entity) pairs.

Return type:

set of tuple[str, str]

Notes

Requires en_core_web_sm. Install with:

python -m spacy download en_core_web_sm

Returns an empty set if the model is not available.

wikipediaGATN.airport_level_functions.parse_infobox_from_wikitext(wikitext: str, verbose: bool = False) dict[source]

Parse the {{Infobox airport}} template from wikitext into a dict.

Parameters:
  • wikitext (str) – Full wikitext source of the airport Wikipedia page.

  • verbose (bool, optional) – Print parsed key list. Default: False.

Returns:

{field_name: cleaned_value, ...} plus derived lat, lon, region, and ISO 3166-2 country fields when available. Returns {} if no infobox is found.

Return type:

dict

wikipediaGATN.airport_level_functions.parse_iso3166_2(region_code: str)[source]

Resolve an ISO 3166-2 region code to country and subdivision details.

Parameters:

region_code (str) – Code in the form "CC-SUB" (e.g. "CA-MB").

Returns:

{'country_alpha3': str, 'country_name': str, 'subdivision_code': str} or None if region_code is invalid or the country is not found.

Return type:

dict or None

wikipediaGATN.airport_level_functions.parse_lat_lon_from_string(coord_string: str)[source]

Parse a coordinate string into decimal-degree latitude and longitude.

Parameters:

coord_string (str) – Any format accepted by geopy (DMS, decimal, etc.).

Returns:

(latitude, longitude) as 6-decimal-place strings, or ("", "") on parse failure.

Return type:

tuple[str, str]

wikipediaGATN.airport_level_functions.parse_wikitext_airlines_destinations(wikitext: str) dict[source]

Extract airline to destinations data from {{Airport-dest-list}} templates.

Only destinations expressed as Wikipedia wikilinks are included.

Parameters:

wikitext (str) – Full wikitext of the airport Wikipedia page.

Returns:

{"passenger": {airline_name: [{"name": str, "wikipedia_url": str}, ...], ...}, "cargo": {...}}

Return type:

dict

wikipediaGATN.airport_level_functions.save_airport_info(airport_info: dict, level: int = 0, verbose: bool = False, save_progress: bool = True, iata_from: str = '') str[source]

Persist an airport info dictionary to TEMP_RESULTS_DIR/<CODE>.<level>.json.

Parameters:
  • airport_info (dict) – Dict as returned by fetch_wikipedia_airport_info().

  • level (int, optional) – BFS distance level from the seed airport. Default: 0.

  • verbose (bool, optional) – Print the saved path. Default: False.

  • save_progress (bool, optional) – Append to processed_locations.csv. Default: True.

Returns:

The IATA code (or wiki_<title> / "unknown") used as the filename prefix.

Return type:

str