API Reference
This page provides an overview of the main modules and functions in wikipediaGATN.
Network Level Functions
Network-level BFS crawling of airport Wikipedia pages.
This module drives the breadth-first expansion of the airport network.
Starting from a seed IATA code, it iteratively fetches each airport’s
destinations (both passenger and cargo), saves the results as
<CODE>.<level>.json files in TEMP_RESULTS_DIR/airports_rooted_sweep,
and tracks progress in processed_locations.csv so that interrupted runs
can be resumed.
Typical usage:
from wikipediaGATN.wikipedia_network_level import iterate_search_until_distance_N
# Crawl two hops out from Winnipeg
iterate_search_until_distance_N("YWG", dist=2, delay=0.5, verbose=True)
Functions
clean_output_directory delete scraped files to start fresh get_connections_level_N expand one BFS level check_processed_list deduplicate / clean progress CSV iterate_search_until_distance_N crawl to a fixed depth iterate_search_until_empty crawl until no new airports are found continue_existing_search_one_step resume a partially-complete crawl by one step continue_existing_search_until_empty resume and run to completion
- wikipediaGATN.wikipedia_network_level.check_processed_list(verbose: bool = False) None[source]
Deduplicate and clean
processed_locations.csv.Exports rows with
iata == "None"tofailed_lookups.csv.Removes those rows and any duplicate URLs from the main file.
Re-sorts by (iata, url).
- Parameters:
verbose (bool, optional) – Print summary counts. Default: False.
- wikipediaGATN.wikipedia_network_level.clean_output_directory(levels=None, verbose: bool = False) int[source]
Delete scraped airport JSON files from
TEMP_RESULTS_DIR.Also removes
processed_locations.csvso the next run starts fresh.- Parameters:
levels (list of int or None, optional) – If
None(default), removes all.<N>.jsonfiles. If a list of integers is given, only files at those BFS levels are removed (e.g.levels=[2, 3]).verbose (bool, optional) – Print a summary of what was removed. Default: False.
- Returns:
Total number of JSON files removed.
- Return type:
int
- wikipediaGATN.wikipedia_network_level.continue_existing_search_one_step(delay: float = 1.0, verbose: bool = False) None[source]
Resume a partially-complete crawl by processing one additional BFS step.
Finds the highest level N already present in
TEMP_RESULTS_DIRand re-runsget_connections_level_N(from_length=N-1)— stepping back one level ensures the previous frontier is complete before advancing.- Parameters:
delay (float, optional) – Seconds between Wikipedia requests. Default: 1.0.
verbose (bool, optional) – Print progress. Default: False.
- wikipediaGATN.wikipedia_network_level.continue_existing_search_until_empty(delay: float = 1.0, verbose: bool = False) None[source]
Resume a partially-complete crawl and run to completion.
Finds the highest BFS level N already present in
TEMP_RESULTS_DIRand continues expanding from that point until no new airports are found.- Parameters:
delay (float, optional) – Seconds between Wikipedia requests. Default: 1.0.
verbose (bool, optional) – Print progress. Default: False.
Notes
Assumes the current highest level is already complete. If it is not, use
continue_existing_search_one_step()first.
- wikipediaGATN.wikipedia_network_level.get_connections_level_N(from_length: int = 0, delay: float = 1.0, verbose: bool = False) int[source]
Expand the airport network by one BFS level.
For every airport file at level from_length (
<CODE>.<from_length>.json), fetch each listed destination (both passenger and cargo) that has not yet been processed and save its data as<CODE>.<from_length+1>.json.- Parameters:
from_length (int, optional) – Source BFS level. Default: 0.
delay (float, optional) – Seconds to sleep between Wikipedia requests. Default: 1.0.
verbose (bool, optional) – Print per-destination progress. Default: False.
- Returns:
Number of new destination files written.
- Return type:
int
- wikipediaGATN.wikipedia_network_level.iterate_search_until_distance_N(seed_iata: str, dist: int = 1, delay: float = 1.0, verbose: bool = False) None[source]
Crawl the airport network to a fixed BFS depth.
- Parameters:
seed_iata (str) – IATA code of the starting airport (e.g.
"YWG").dist (int, optional) – Maximum BFS depth.
dist=1fetches only direct connections from the seed. Default: 1.delay (float, optional) – Seconds to sleep between Wikipedia requests. Default: 1.0.
verbose (bool, optional) – Print per-airport progress. Default: False.
- wikipediaGATN.wikipedia_network_level.iterate_search_until_empty(seed_iata: str, delay: float = 1.0, verbose: bool = False) None[source]
Crawl the airport network until no new airports are discovered.
- Parameters:
seed_iata (str) – IATA code of the starting airport.
delay (float, optional) – Seconds between Wikipedia requests. Default: 1.0.
verbose (bool, optional) – Print per-airport progress. Default: False.
Notes
For a global crawl this may run for many hours. Use
iterate_search_until_distance_N()if you want a bounded run.
Result Processing
Orchestration functions for GATN (Global Air Transportation Networks) generation.
This module provides the complete pipeline to extract network structures from the parsed airport data. It drives the multi-pass IATA code recovery workflow, generates outbound connection lists, builds sparse adjacency matrices, and exports rich network graphs for both the Passenger (Pax) and Cargo networks.
- wikipediaGATN.result_processing_network.run_two_pass_iata_extraction(batch_size: int = 50, delay: float = 0.5, verbose: bool = False) dict[source]
Execute Passes 2 and 3 of the IATA recovery workflow.
Assumes Pass 1 (
create_outbound_connections_list()) has already been run andunmapped_destinations.csvexists.- Pass 2
Attempts instantaneous offline OurAirports lookup for unmapped URLs, falling back to fetching the Wikipedia page to extract the IATA code.
- Pass 3
Filters successful extractions by confidence and writes
manual_airport_mapping.csvfor use in the nextcreate_outbound_connections_list()call.
- Parameters:
batch_size (int, optional) – Number of HTTP requests before a longer pause is inserted to respect Wikipedia’s servers. Default: 50.
delay (float, optional) – Per-request delay in seconds. Default: 0.5.
verbose (bool, optional) – If True, prints detailed per-URL progress. Default: False.
- Returns:
{'extraction_result': dict, 'mapping_count': int}extraction_result is the dict returned by
extract_iata_from_unmapped_destinations()(keys:total,successful,skipped,failed,csv_path).mapping_count is the number of entries written to
manual_airport_mapping.csv.- Return type:
dict
Airport Level Functions
Airport-level Wikipedia scraping and data extraction functions.
Functions in this module interact with the Wikipedia API to fetch and parse airport pages, and supplement the extracted data using the authoritative OurAirports database.
Core Pipeline Functions:
fetch_wikipedia_airport_link()resolve an identifier to a URLfetch_wikipedia_airport_html()fetch parsed HTMLfetch_wikipedia_airport_wikitext()fetch raw wikitextfetch_wikipedia_airlines()set of airline namesfetch_wikipedia_destinations()set of (name, URL) tuplesfetch_wikipedia_airlines_destinations()airline → destinations mapfetch_wikipedia_airport_info()all metadata in one dictsave_airport_info()persist dict to JSON + progress CSV
OurAirports Integration & Validation:
infer_missing_geographic_data()supplement missing data via OurAirportscompare_airports_with_ourairports()audit extracted data against OurAirportsfind_active_missing_airports()identify unmapped airports in OurAirportsbuild_url_to_codes_map()map Wikipedia URLs to IATA codes
Helper / Fallback functions:
format_destinations_list()
- wikipediaGATN.airport_level_functions.clean_infobox_value(value: str) str[source]
Normalise a wikitext infobox value string.
{{nowrap|...}}-> inner content{{Unbulleted list|...}}-> comma-separated items{{URL|...}}-> bare URL
- Parameters:
value (str) – Raw wikitext value.
- Returns:
Cleaned string with wikilinks preserved as
[[X]].- Return type:
str
- wikipediaGATN.airport_level_functions.fallback_fetch_wikipedia_airport_info(html_content: str) dict[source]
Extract basic airport info from HTML when the infobox cannot be parsed.
- Parameters:
html_content (str) – Parsed HTML of the airport Wikipedia page.
- Returns:
Keys:
iata,icao,serves,location,coordinates,wikipedia_url. Values are strings orNone.- Return type:
dict
- wikipediaGATN.airport_level_functions.fetch_wikipedia_airlines(identifier: str = 'YWG', link=None, html_content=None, verbose: bool = False, soup=None) set[source]
Extract airline names from an airport’s Wikipedia page.
- Parameters:
identifier (str, optional) – IATA/ICAO code, Wikipedia URL, or name. Default:
"YWG".link (str or None, optional) – Wikipedia page URL (fetched automatically if absent).
html_content (str or None, optional) – Pre-fetched HTML (fetched automatically if absent).
verbose (bool, optional) – Print progress. Default: False.
soup (BeautifulSoup or None, optional) – Pre-parsed BeautifulSoup object. Default: None.
- Returns:
Airline names extracted from the Airlines and destinations table.
- Return type:
set of str
- wikipediaGATN.airport_level_functions.fetch_wikipedia_airlines_destinations(identifier: str = 'YWG', link=None, html_content=None, verbose: bool = False, soup=None) dict[source]
Extract an airline to destinations mapping from an airport’s Wikipedia page.
Falls back to
parse_fallback_nlp_airlines_destinations()if no table-based data is found.- Parameters:
identifier (str, optional) – IATA/ICAO code, Wikipedia URL, or name. Default:
"YWG".link (str or None, optional) – Wikipedia page URL (fetched automatically if absent).
html_content (str or None, optional) – Pre-fetched HTML (fetched automatically if absent).
verbose (bool, optional) – Print progress. Default: False.
soup (BeautifulSoup or None, optional) – Pre-parsed BeautifulSoup object. Default: None.
- Returns:
{"passenger": {airline_name: {destination_name, ...}, ...}, "cargo": {...}}- Return type:
dict
- wikipediaGATN.airport_level_functions.fetch_wikipedia_airport_html(link: str, verbose: bool = False)[source]
Fetch the parsed HTML content of a Wikipedia page.
- Parameters:
link (str) – Wikipedia page URL (
https://en.wikipedia.org/wiki/<Title>).verbose (bool, optional) – Print progress messages. Default: False.
- Returns:
HTML string, or
Noneon failure.- Return type:
str or None
- wikipediaGATN.airport_level_functions.fetch_wikipedia_airport_info(identifier: str = 'YWG', link=None, verbose: bool = False) dict[source]
Extract all available metadata for an airport from its Wikipedia page.
- Parameters:
identifier (str, optional) – IATA/ICAO code, Wikipedia URL, or name. Default:
"YWG".link (str or None, optional) – Wikipedia page URL (fetched automatically if absent).
verbose (bool, optional) – Print progress. Default: False.
- Returns:
Keys:
iata,icao,city-served,location,lat,lon,altitude,region,country_alpha3,country_name,subdivision_code,wikipedia_url,airlines,destinations,airlines_destinations.- Return type:
dict
- wikipediaGATN.airport_level_functions.fetch_wikipedia_airport_link(identifier: str, verbose: bool = False)[source]
Resolve an airport identifier to its Wikipedia page URL.
Resolution order:
If identifier is already a Wikipedia URL, the title is decoded from it.
If identifier matches
[A-Za-z]{3}or[A-Za-z]{4}(IATA/ICAO), search for"<CODE> airport".Otherwise, treat identifier as a free-text name; append
" airport"if the word is not already present.
- Parameters:
identifier (str) – IATA code, ICAO code, Wikipedia URL, or free-text airport name.
verbose (bool, optional) – Print search term and result. Default: False.
- Returns:
Canonical Wikipedia URL, or
Noneif no page was found.- Return type:
str or None
- wikipediaGATN.airport_level_functions.fetch_wikipedia_airport_wikitext(link: str, verbose: bool = False)[source]
Fetch the raw wikitext source of a Wikipedia page.
- Parameters:
link (str) – Wikipedia page URL.
verbose (bool, optional) – Print progress messages. Default: False.
- Returns:
Wikitext string, or
Noneon failure.- Return type:
str or None
- wikipediaGATN.airport_level_functions.fetch_wikipedia_destinations(identifier: str = 'YWG', link=None, html_content=None, verbose: bool = False, soup=None) set[source]
Extract destination (name, Wikipedia URL) pairs from an airport page.
- Parameters:
identifier (str, optional) – IATA/ICAO code, Wikipedia URL, or name. Default:
"YWG".link (str or None, optional) – Wikipedia page URL (fetched automatically if absent).
html_content (str or None, optional) – Pre-fetched HTML (fetched automatically if absent).
verbose (bool, optional) – Print progress. Default: False.
soup (BeautifulSoup or None, optional) – Pre-parsed BeautifulSoup object. Default: None.
- Returns:
(destination_name, wikipedia_url)pairs.- Return type:
set of tuple[str, str]
- wikipediaGATN.airport_level_functions.format_airport_json(data: dict) dict[source]
Enforce a strict ordering of JSON keys for airport data output.
- wikipediaGATN.airport_level_functions.parse_fallback_nlp_airlines_destinations(html_content: str, verbose: bool = False, soup=None) set[source]
Use spaCy NER to extract (airline, destination) pairs as a last resort.
- Parameters:
html_content (str) – Parsed HTML of the airport Wikipedia page.
verbose (bool, optional) – Print match counts. Default: False.
soup (BeautifulSoup or None, optional) – Pre-parsed BeautifulSoup object. Default: None.
- Returns:
(ORG entity, GPE entity)pairs.- Return type:
set of tuple[str, str]
Notes
Requires
en_core_web_sm. Install with:python -m spacy download en_core_web_sm
Returns an empty set if the model is not available.
- wikipediaGATN.airport_level_functions.parse_infobox_from_wikitext(wikitext: str, verbose: bool = False) dict[source]
Parse the
{{Infobox airport}}template from wikitext into a dict.- Parameters:
wikitext (str) – Full wikitext source of the airport Wikipedia page.
verbose (bool, optional) – Print parsed key list. Default: False.
- Returns:
{field_name: cleaned_value, ...}plus derivedlat,lon,region, and ISO 3166-2 country fields when available. Returns{}if no infobox is found.- Return type:
dict
- wikipediaGATN.airport_level_functions.parse_iso3166_2(region_code: str)[source]
Resolve an ISO 3166-2 region code to country and subdivision details.
- Parameters:
region_code (str) – Code in the form
"CC-SUB"(e.g."CA-MB").- Returns:
{'country_alpha3': str, 'country_name': str, 'subdivision_code': str}orNoneif region_code is invalid or the country is not found.- Return type:
dict or None
- wikipediaGATN.airport_level_functions.parse_lat_lon_from_string(coord_string: str)[source]
Parse a coordinate string into decimal-degree latitude and longitude.
- Parameters:
coord_string (str) – Any format accepted by geopy (DMS, decimal, etc.).
- Returns:
(latitude, longitude)as 6-decimal-place strings, or("", "")on parse failure.- Return type:
tuple[str, str]
- wikipediaGATN.airport_level_functions.parse_wikitext_airlines_destinations(wikitext: str) dict[source]
Extract airline to destinations data from
{{Airport-dest-list}}templates.Only destinations expressed as Wikipedia wikilinks are included.
- Parameters:
wikitext (str) – Full wikitext of the airport Wikipedia page.
- Returns:
{"passenger": {airline_name: [{"name": str, "wikipedia_url": str}, ...], ...}, "cargo": {...}}- Return type:
dict
- wikipediaGATN.airport_level_functions.save_airport_info(airport_info: dict, level: int = 0, verbose: bool = False, save_progress: bool = True, iata_from: str = '') str[source]
Persist an airport info dictionary to
TEMP_RESULTS_DIR/<CODE>.<level>.json.- Parameters:
airport_info (dict) – Dict as returned by
fetch_wikipedia_airport_info().level (int, optional) – BFS distance level from the seed airport. Default: 0.
verbose (bool, optional) – Print the saved path. Default: False.
save_progress (bool, optional) – Append to
processed_locations.csv. Default: True.
- Returns:
The IATA code (or
wiki_<title>/"unknown") used as the filename prefix.- Return type:
str