|
| 1 | +# MWM size prediction model |
| 2 | + |
| 3 | +The web application uses a data-science-based model to predict size MWM files |
| 4 | +compiled from the borders. Here described are the efforts that were undertaken |
| 5 | +to build such a prediction model. The serialized model resides at `web/app/data/` |
| 6 | +in the `model.pkl` and `scaler.pkl` files. Its first variant was trained only |
| 7 | +on county-level data and is valid at limited parameters range (see web/app/config.py |
| 8 | +for the model limitations). Now we try to extend the model to predict also |
| 9 | +province-level regions. |
| 10 | + |
| 11 | +## Data gathering |
| 12 | + |
| 13 | +We chosen countries/regions with dense OSM data and took them as the training |
| 14 | +dataset. As a first try Germany, Austria, Belgium and Netherlands where taken |
| 15 | +giving about 950 borders of different admin levels. The sample was found to |
| 16 | +be too small for good training. |
| 17 | +Then Norway, Switzerland, Ile-de-France of France, Japan, United Kingdom, Belarus, |
| 18 | +4 states: California, Texas, New York, Washington – of the United States |
| 19 | +were added. |
| 20 | + |
| 21 | +#### Geographic data gathering |
| 22 | + |
| 23 | +First, with the help of the web app I split the forementioned countries/regions down to the |
| 24 | +"county"-subregions in general sense of a "county" – it's an admin level |
| 25 | +which is too small for MWMs, but regions of one level higher are too big, |
| 26 | +so that a usual MWM would be a cluster of "counties". |
| 27 | + |
| 28 | +Japan was a special case, so `extract_mwm_geo_data.py` script contains a |
| 29 | +function to split the country into subregions. |
| 30 | + |
| 31 | +The `extract_mwm_geo_data.py` script, endowed with a valid connection |
| 32 | +to the database with borders, gathers information about all borders of a |
| 33 | +given country/region and its descendants: id, parent_id, admin level, name, |
| 34 | +full area, land area (so the table with land borders of the planet is necessary), |
| 35 | +city/town count and population, hamlet/village count and population. |
| 36 | + |
| 37 | +One should keep in mind that some borders may be absent in OSM, so a region may |
| 38 | +not be fully covered by subregions. So, a region area (or places cout, |
| 39 | +or population) may be greater than the sum of areas of its subregions. |
| 40 | +One way is to fix borders by hand. Another way, that I followed, is to select |
| 41 | +areas, cities and population from the database even for upper-level regions |
| 42 | +(except countries, for which the calculation would run too long and is not useful). |
| 43 | + |
| 44 | +#### Mwm size data gathering |
| 45 | + |
| 46 | +Having borders division of the training countries in the web app, I download all |
| 47 | +borders, changing the poly-file naming procedure so that the name to contain |
| 48 | +the region id. The id would be the link between files with geodata and mwm sizes data. |
| 49 | +So we have many border file with names like _03565917_Japan_Gunma Prefecture_Numata.poly_ |
| 50 | +that I place into the `omim/data/borders/` directory instead of original |
| 51 | +borders. |
| 52 | + |
| 53 | +Also, I did a *.o5m-extract for each country to supply the maps_generator |
| 54 | +not with the whole planet-latest.o5m file. I used https://boundingbox.klokantech.com |
| 55 | +to find a polygon for an extract, first getting geojson of ten points or so at the |
| 56 | +website and then composing a *.poly file in a text editor. With this |
| 57 | +`country.poly` file I got a country extract with `osmconvert` tool: |
| 58 | +```bash |
| 59 | +osmctools/osmconvert planet-latest.o5m -B=country.poly -o=country.o5m |
| 60 | +``` |
| 61 | + |
| 62 | +Then |
| 63 | +```bash |
| 64 | +md5sum country.o5m > country.o5m.md5 |
| 65 | +``` |
| 66 | +In `maps_generation.ini` I changed the path to the planet and md5sum file and run |
| 67 | +the MWMs generation with |
| 68 | +```bash |
| 69 | +nohup python -m maps_generator --order="" --skip="Routing,RoutingTransit" \ |
| 70 | + --without_countries="World*" --countries="*_Switzerland_*" & |
| 71 | +``` |
| 72 | + |
| 73 | +For the asterisk to work at the beginning of the mask in the `--countries` option, |
| 74 | +I made some changes to `omim/tools/generator/maps_generator/__main__.py`: |
| 75 | + |
| 76 | +```python |
| 77 | + def end_star_compare(prefix, full): |
| 78 | + return full.startswith(prefix) |
| 79 | + |
| 80 | + def start_star_compare(suffix, full): |
| 81 | + return full.endswith(suffix) |
| 82 | + |
| 83 | + def both_star_compare(substr, full): |
| 84 | + return substr in full |
| 85 | + |
| 86 | + ... |
| 87 | + cmp = compare |
| 88 | + _raw_country = country_item[:] |
| 89 | + if _raw_country: |
| 90 | + if all(_raw_country[i] == "*" for i in (0, -1)): |
| 91 | + _raw_country = _raw_country.replace("*", "") |
| 92 | + cmp = both_star_compare |
| 93 | + elif _raw_country[-1] == "*": |
| 94 | + _raw_country = _raw_country.replace("*", "") |
| 95 | + cmp = end_star_compare |
| 96 | + elif _raw_country[0] == "*": |
| 97 | + _raw_country = _raw_country.replace("*", "") |
| 98 | + cmp = start_star_compare |
| 99 | + |
| 100 | +``` |
| 101 | + |
| 102 | +After all mwms for a country had beed generated in a directory like |
| 103 | +`maps_build/2021_01_20__18_06_38/210120` |
| 104 | +I got their sizes (in Kb) with this command: |
| 105 | + |
| 106 | +```bash |
| 107 | +du maps_build/maps_build/2021_01_20__18_06_38/210120/*.mwm | sort -k2 > Norway.sizes |
| 108 | +``` |
| 109 | + |
| 110 | +In fact, I renamed directory to some 2021_01_20__18_06_38-Norway and used command |
| 111 | +```bash |
| 112 | +du maps_build/*-Norway/[0-9]*/*.mwm | sort -k2 > Norway.sizes |
| 113 | +``` |
| 114 | + |
| 115 | +#### Combining geo data with sizes data |
| 116 | + |
| 117 | +Now I had a set of `<Country>_regions.json` and `<Country>.sizes` files |
| 118 | +with geo- and sizes-data respectively on several large regions with subregions. |
| 119 | +I used the `combine_data.py` script to generate one big `7countries.csv`. |
| 120 | + |
| 121 | +Yet another `4countries.csv` file with Germany, Austria, Belgium and Netherlands |
| 122 | +subregions was already prepared before, it has excluded=1 flag for those |
| 123 | +Netherland subregions which contain much water (inner waters, not ocean). Also, |
| 124 | +there were not data for upper-lever regions, and the values of area, cities, |
| 125 | +population and mwm_size were obtained as the sum of subregions defined by |
| 126 | +parent_id column. |
| 127 | + |
| 128 | +Since data for country-level regions was not collected (due to long sql queries and |
| 129 | +mwm generation time), we enrich the `7countries.csv` dataset with country-level |
| 130 | +by summing up data of subregions: |
| 131 | +```python |
| 132 | +import pandas as pd |
| 133 | +data7 = pd.read_csv('data/7countries.csv', sep=';') |
| 134 | + |
| 135 | +# Drop data for countries if it present |
| 136 | +data7 = data7[data7['al'] != 2] |
| 137 | + |
| 138 | +countries = {'id': [-59065, -2978650, -51701, -382313, -62149], |
| 139 | + 'name': ['Belarus', 'Norway', 'Switzerland', 'Japan', 'United Kingdom'], |
| 140 | + 'excluded': [0]*5, |
| 141 | + 'al': [2]*5, |
| 142 | + } |
| 143 | +sum_fields = ('full_area', 'land_area', 'city_cnt', 'hamlet_cnt', 'city_pop', 'hamlet_pop', 'mwm_size_sum') |
| 144 | + |
| 145 | +for field in sum_fields: |
| 146 | + field_values = [data7[data7['parent_id'] == c_id][field].sum() for c_id in countries['id']] |
| 147 | + countries[field] = field_values |
| 148 | + |
| 149 | +countries_df = pd.DataFrame(countries, columns = list(countries.keys())) |
| 150 | +data7 = pd.concat([data7, countries_df]) |
| 151 | +data7.to_csv('data/7countries-1.csv', index=False, sep=';') |
| 152 | + |
| 153 | +# Check, and if all right, do |
| 154 | +# import os; os.rename('data/7countries-1.csv', 'data/7countries.csv') |
| 155 | +``` |
| 156 | + |
| 157 | +The union of `4countries.csv` and `7countries.csv` data is the |
| 158 | +dataset for data science experiments on mwm size prediction. Keep in mind |
| 159 | +that _mwm_size_ field may be NULL (for countries), or _mwm_size_sum_ may be NULL |
| 160 | +(in 4countries.csv). Make corrections when getting combined dataset: |
| 161 | + |
| 162 | +```python |
| 163 | +import pandas as pd |
| 164 | +import numpy as np |
| 165 | + |
| 166 | +def fit_mwm_size(df): |
| 167 | + df['mwm_size'] = np.where(df['mwm_size'].isnull(), df['mwm_size_sum'], df['mwm_size']) |
| 168 | + |
| 169 | +data1 = pd.read_csv('data/4countries.csv', sep=';') # Austria, Belgium, Netherlands, Germany |
| 170 | +data2 = pd.read_csv('data/7countries.csv', sep=';') # Norway, UK, US(4 states), Switzerland, Japan, Belarus, Ile-de-France |
| 171 | + |
| 172 | +data = pd.concat([data1, data2]) |
| 173 | + |
| 174 | +data = data[data.excluded.eq(0) & data.id.notnull()] |
| 175 | + |
| 176 | +fit_mwm_size(data) |
| 177 | +``` |
0 commit comments