Skip to content

Commit 091cddc

Browse files
MWM size prediction research: scripts, descriptions and data
1 parent 360d52b commit 091cddc

20 files changed

Lines changed: 34297 additions & 0 deletions

mwm_size_prediction/README.md

Lines changed: 177 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,177 @@
1+
# MWM size prediction model
2+
3+
The web application uses a data-science-based model to predict size MWM files
4+
compiled from the borders. Here described are the efforts that were undertaken
5+
to build such a prediction model. The serialized model resides at `web/app/data/`
6+
in the `model.pkl` and `scaler.pkl` files. Its first variant was trained only
7+
on county-level data and is valid at limited parameters range (see web/app/config.py
8+
for the model limitations). Now we try to extend the model to predict also
9+
province-level regions.
10+
11+
## Data gathering
12+
13+
We chosen countries/regions with dense OSM data and took them as the training
14+
dataset. As a first try Germany, Austria, Belgium and Netherlands where taken
15+
giving about 950 borders of different admin levels. The sample was found to
16+
be too small for good training.
17+
Then Norway, Switzerland, Ile-de-France of France, Japan, United Kingdom, Belarus,
18+
4 states: California, Texas, New York, Washington – of the United States
19+
were added.
20+
21+
#### Geographic data gathering
22+
23+
First, with the help of the web app I split the forementioned countries/regions down to the
24+
"county"-subregions in general sense of a "county" – it's an admin level
25+
which is too small for MWMs, but regions of one level higher are too big,
26+
so that a usual MWM would be a cluster of "counties".
27+
28+
Japan was a special case, so `extract_mwm_geo_data.py` script contains a
29+
function to split the country into subregions.
30+
31+
The `extract_mwm_geo_data.py` script, endowed with a valid connection
32+
to the database with borders, gathers information about all borders of a
33+
given country/region and its descendants: id, parent_id, admin level, name,
34+
full area, land area (so the table with land borders of the planet is necessary),
35+
city/town count and population, hamlet/village count and population.
36+
37+
One should keep in mind that some borders may be absent in OSM, so a region may
38+
not be fully covered by subregions. So, a region area (or places cout,
39+
or population) may be greater than the sum of areas of its subregions.
40+
One way is to fix borders by hand. Another way, that I followed, is to select
41+
areas, cities and population from the database even for upper-level regions
42+
(except countries, for which the calculation would run too long and is not useful).
43+
44+
#### Mwm size data gathering
45+
46+
Having borders division of the training countries in the web app, I download all
47+
borders, changing the poly-file naming procedure so that the name to contain
48+
the region id. The id would be the link between files with geodata and mwm sizes data.
49+
So we have many border file with names like _03565917_Japan_Gunma Prefecture_Numata.poly_
50+
that I place into the `omim/data/borders/` directory instead of original
51+
borders.
52+
53+
Also, I did a *.o5m-extract for each country to supply the maps_generator
54+
not with the whole planet-latest.o5m file. I used https://boundingbox.klokantech.com
55+
to find a polygon for an extract, first getting geojson of ten points or so at the
56+
website and then composing a *.poly file in a text editor. With this
57+
`country.poly` file I got a country extract with `osmconvert` tool:
58+
```bash
59+
osmctools/osmconvert planet-latest.o5m -B=country.poly -o=country.o5m
60+
```
61+
62+
Then
63+
```bash
64+
md5sum country.o5m > country.o5m.md5
65+
```
66+
In `maps_generation.ini` I changed the path to the planet and md5sum file and run
67+
the MWMs generation with
68+
```bash
69+
nohup python -m maps_generator --order="" --skip="Routing,RoutingTransit" \
70+
--without_countries="World*" --countries="*_Switzerland_*" &
71+
```
72+
73+
For the asterisk to work at the beginning of the mask in the `--countries` option,
74+
I made some changes to `omim/tools/generator/maps_generator/__main__.py`:
75+
76+
```python
77+
def end_star_compare(prefix, full):
78+
return full.startswith(prefix)
79+
80+
def start_star_compare(suffix, full):
81+
return full.endswith(suffix)
82+
83+
def both_star_compare(substr, full):
84+
return substr in full
85+
86+
...
87+
cmp = compare
88+
_raw_country = country_item[:]
89+
if _raw_country:
90+
if all(_raw_country[i] == "*" for i in (0, -1)):
91+
_raw_country = _raw_country.replace("*", "")
92+
cmp = both_star_compare
93+
elif _raw_country[-1] == "*":
94+
_raw_country = _raw_country.replace("*", "")
95+
cmp = end_star_compare
96+
elif _raw_country[0] == "*":
97+
_raw_country = _raw_country.replace("*", "")
98+
cmp = start_star_compare
99+
100+
```
101+
102+
After all mwms for a country had beed generated in a directory like
103+
`maps_build/2021_01_20__18_06_38/210120`
104+
I got their sizes (in Kb) with this command:
105+
106+
```bash
107+
du maps_build/maps_build/2021_01_20__18_06_38/210120/*.mwm | sort -k2 > Norway.sizes
108+
```
109+
110+
In fact, I renamed directory to some 2021_01_20__18_06_38-Norway and used command
111+
```bash
112+
du maps_build/*-Norway/[0-9]*/*.mwm | sort -k2 > Norway.sizes
113+
```
114+
115+
#### Combining geo data with sizes data
116+
117+
Now I had a set of `<Country>_regions.json` and `<Country>.sizes` files
118+
with geo- and sizes-data respectively on several large regions with subregions.
119+
I used the `combine_data.py` script to generate one big `7countries.csv`.
120+
121+
Yet another `4countries.csv` file with Germany, Austria, Belgium and Netherlands
122+
subregions was already prepared before, it has excluded=1 flag for those
123+
Netherland subregions which contain much water (inner waters, not ocean). Also,
124+
there were not data for upper-lever regions, and the values of area, cities,
125+
population and mwm_size were obtained as the sum of subregions defined by
126+
parent_id column.
127+
128+
Since data for country-level regions was not collected (due to long sql queries and
129+
mwm generation time), we enrich the `7countries.csv` dataset with country-level
130+
by summing up data of subregions:
131+
```python
132+
import pandas as pd
133+
data7 = pd.read_csv('data/7countries.csv', sep=';')
134+
135+
# Drop data for countries if it present
136+
data7 = data7[data7['al'] != 2]
137+
138+
countries = {'id': [-59065, -2978650, -51701, -382313, -62149],
139+
'name': ['Belarus', 'Norway', 'Switzerland', 'Japan', 'United Kingdom'],
140+
'excluded': [0]*5,
141+
'al': [2]*5,
142+
}
143+
sum_fields = ('full_area', 'land_area', 'city_cnt', 'hamlet_cnt', 'city_pop', 'hamlet_pop', 'mwm_size_sum')
144+
145+
for field in sum_fields:
146+
field_values = [data7[data7['parent_id'] == c_id][field].sum() for c_id in countries['id']]
147+
countries[field] = field_values
148+
149+
countries_df = pd.DataFrame(countries, columns = list(countries.keys()))
150+
data7 = pd.concat([data7, countries_df])
151+
data7.to_csv('data/7countries-1.csv', index=False, sep=';')
152+
153+
# Check, and if all right, do
154+
# import os; os.rename('data/7countries-1.csv', 'data/7countries.csv')
155+
```
156+
157+
The union of `4countries.csv` and `7countries.csv` data is the
158+
dataset for data science experiments on mwm size prediction. Keep in mind
159+
that _mwm_size_ field may be NULL (for countries), or _mwm_size_sum_ may be NULL
160+
(in 4countries.csv). Make corrections when getting combined dataset:
161+
162+
```python
163+
import pandas as pd
164+
import numpy as np
165+
166+
def fit_mwm_size(df):
167+
df['mwm_size'] = np.where(df['mwm_size'].isnull(), df['mwm_size_sum'], df['mwm_size'])
168+
169+
data1 = pd.read_csv('data/4countries.csv', sep=';') # Austria, Belgium, Netherlands, Germany
170+
data2 = pd.read_csv('data/7countries.csv', sep=';') # Norway, UK, US(4 states), Switzerland, Japan, Belarus, Ile-de-France
171+
172+
data = pd.concat([data1, data2])
173+
174+
data = data[data.excluded.eq(0) & data.id.notnull()]
175+
176+
fit_mwm_size(data)
177+
```
Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
import json
2+
import csv
3+
4+
5+
def get_combined_info(region_name):
6+
with open(f'data/{region_name}_regions.json', newline='') as f:
7+
regions = json.load(f)
8+
regions = {int(k):v for k, v in regions.items()}
9+
10+
with open(f'data/{region_name}.sizes') as sizes_file:
11+
for line in sizes_file:
12+
mwm_name = line.split('/')[-1][:-4]
13+
#print(f"mwm_name = {mwm_name}")
14+
r_id = -int(mwm_name.split('_')[0])
15+
if r_id not in regions:
16+
raise Exception(f'id {r_id} not in {region_name} data')
17+
size = int(line.split()[0])
18+
name = mwm_name.split('_')[-1]
19+
country = mwm_name.split('_')[1]
20+
21+
regions[r_id].update({
22+
'mwm_name': mwm_name,
23+
'country': country,
24+
'mwm_size': size,
25+
})
26+
27+
28+
admin_levels = set(x['al'] for x in regions.values())
29+
30+
ids_to_remove = [] # Far oversea regions may be counted but no mwm generated for
31+
for al in sorted(admin_levels, reverse=True):
32+
for r_id, r_data in ((r_id, r_data) for r_id, r_data in regions.items() if r_data['al'] == al):
33+
children = [ch for ch in regions.values() if ch['parent_id'] == r_id]
34+
is_leaf = not bool(children)
35+
r_data['is_leaf'] = int(is_leaf)
36+
r_data['excluded'] = 0
37+
if is_leaf:
38+
if 'mwm_size' not in r_data:
39+
print(f"Mwm not generated for {r_data['name']}")
40+
ids_to_remove.append(r_id)
41+
else:
42+
r_data['mwm_size_sum'] = r_data['mwm_size']
43+
else:
44+
r_data['mwm_size_sum'] = sum(ch['mwm_size'] for ch in children)
45+
46+
return {k:v for k,v in regions.items() if k not in ids_to_remove}
47+
48+
49+
def main():
50+
region_names = [
51+
'Belarus', 'Switzerland', 'Ile-de-France',
52+
'United Kingdom', 'Norway', 'Japan', 'United States'
53+
]
54+
55+
rows = []
56+
57+
# full_area includes ocean.
58+
fieldnames = ['id', 'parent_id', 'al', 'is_leaf', 'excluded', 'name', 'mwm_name', 'country',
59+
'city_cnt', 'city_pop', 'hamlet_cnt', 'hamlet_pop',
60+
'full_area', 'land_area', 'mwm_size', 'mwm_size_sum']
61+
62+
with open('data/7countries.csv', 'w', newline='') as csvfile:
63+
writer = csv.DictWriter(csvfile, delimiter=';', fieldnames=fieldnames)
64+
writer.writeheader()
65+
66+
for region_name in region_names:
67+
regions = get_combined_info(region_name)
68+
rows = sorted(regions.values(), key=lambda reg: (reg['al'], reg['name']))
69+
writer.writerows(rows)
70+
71+
72+
if __name__ == '__main__':
73+
main()

0 commit comments

Comments
 (0)