sdk-api + parse
Turn MSDN's Windows SDK API into a .json file (and do so fast!)
The generated JSON for CreateFileW looks like this:
Details
{
"CreateFileW": {
"header": "fileapi.h",
"lib": "Kernel32.lib",
"dll": "Kernel32.dll",
"min_client_version": "Windows XP [desktop apps only]",
"min_server_version": "Windows Server 2003 [desktop apps only]",
"metadata": {
"UID": "NF:fileapi.CreateFileW",
"title": "CreateFileW function (fileapi.h)",
"api_type": [
"DllExport"
],
"api_location": [
"api-ms-win-core-file-l1-2-5.dll",
"api-ms-win-core-file-l1-2-4.dll",
"api-ms-win-core-file-l1-2-3.dll",
"Kernel32.dll",
"API-MS-Win-Core-File-l1-1-0.dll",
"KernelBase.dll",
"API-MS-Win-Core-File-l1-2-0.dll",
"API-MS-Win-Core-File-l1-2-1.dll",
"API-MS-Win-Core-File-l1-2-2.dll",
"API-MS-Win-DownLevel-Kernel32-l1-1-0.dll",
"MinKernelBase.dll"
],
"api_name": [
"CreateFile",
"CreateFileA",
"CreateFileW"
]
},
"params": {
"lpFileName": {
"directions": [
"in"
],
"values": {}
},
"dwDesiredAccess": {
"directions": [
"in"
],
"values": {}
},
"dwShareMode": {
"directions": [
"in"
],
"values": {
"FILE_SHARE_DELETE": 4,
"FILE_SHARE_READ": 1,
"FILE_SHARE_WRITE": 2
}
},
"lpSecurityAttributes": {
"directions": [
"in",
"optional"
],
"values": {}
},
"dwCreationDisposition": {
"directions": [
"in"
],
"values": {
"CREATE_ALWAYS": 2,
"CREATE_NEW": 1,
"OPEN_ALWAYS": 4,
"OPEN_EXISTING": 3,
"TRUNCATE_EXISTING": 5
}
},
"dwFlagsAndAttributes": {
"directions": [
"in"
],
"values": {
"FILE_ATTRIBUTE_ARCHIVE": 32,
"FILE_ATTRIBUTE_ENCRYPTED": 16384,
"FILE_ATTRIBUTE_HIDDEN": 2,
"FILE_ATTRIBUTE_NORMAL": 128,
"FILE_ATTRIBUTE_OFFLINE": 4096,
"FILE_ATTRIBUTE_READONLY": 1,
"FILE_ATTRIBUTE_SYSTEM": 4,
"FILE_ATTRIBUTE_TEMPORARY": 256,
"FILE_FLAG_BACKUP_SEMANTICS": 33554432,
"FILE_FLAG_DELETE_ON_CLOSE": 67108864,
"FILE_FLAG_NO_BUFFERING": 536870912,
"FILE_FLAG_OPEN_NO_RECALL": 1048576,
"FILE_FLAG_OPEN_REPARSE_POINT": 2097152,
"FILE_FLAG_OVERLAPPED": 1073741824,
"FILE_FLAG_POSIX_SEMANTICS": 16777216,
"FILE_FLAG_RANDOM_ACCESS": 268435456,
"FILE_FLAG_SESSION_AWARE": 8388608,
"FILE_FLAG_SEQUENTIAL_SCAN": 134217728,
"FILE_FLAG_WRITE_THROUGH": 2147483648,
"SECURITY_ANONYMOUS": null,
"SECURITY_CONTEXT_TRACKING": null,
"SECURITY_DELEGATION": null,
"SECURITY_EFFECTIVE_ONLY": null,
"SECURITY_IDENTIFICATION": null,
"SECURITY_IMPERSONATION": null
}
},
"hTemplateFile": {
"directions": [
"in",
"optional"
],
"values": {}
}
}
}
}usage: sparse.py [-h] [-o OUTPUT] [--chunk-size CHUNK_SIZE] [--workers WORKERS] [--silent] input_dir
Parse SDK-API Native Function .md files to JSON
positional arguments:
input_dir Root content directory to search for nf-*.md files
options:
-h, --help show this help message and exit
-o, --output OUTPUT Output JSON file path
--chunk-size CHUNK_SIZE
Files per worker chunk
--workers WORKERS Number of parallel workers, or "max" to use _MAX_WINDOWS_WORKERS
--silent Suppress all output
Example invocation:
python.exe sparse.py -o output.json .\sdk-api\sdk-api-src\content\
~8 seconds to finish 43,337 entries on:
Intel(R) Core(TM) i9-14900HX, 16c/32t, using 32 workers and chunk-size 64 (default settings).
The way sdk-api is structured, the relvant markdown files go in their sdk-api-src/content folder, which then splits everything up in more folders.
Sometimes by the respective WinSDK header, sometimes by other criteria.
We glob all the files that start in nf- (Native Function), and end in .md, in the path mentioned previously, and then we parse them using the parser Python module we implement.
The files have a general structure:
- they start with a YAML string (which contains a lot of important metadata)
- they contain a function/method header
- this is where we extract the root JSON object names from
- they may contain
-paramsubheaders- inside, these usually contain a HTML table, in the case of parameters which have a set of defined common values exposed on their documentation page.
Our parser works with the aforementioned facts to extract information.
A long, long time ago, the person behind Hexacorn published "Hunting for Windows API prototypes and descriptions…" (Posted on 2023-10-25), in which article they discuss existing projects that parse WinAPI metadata.
The following projects discussed are:
- msdocsviewer by Alexander Hanel
- Mandiant's IDA MSDN annotations
- etc.
They also discuss the history of how Microsoft distributed the documentation that is today available on MSDN in the past (HLP files, CHM files most recently, etc.)
Those are particularly helpful to avoid web requests to Microsoft's MSDN, and the (even more) unreliable process of parsing the HTML generated from the Markdown they are based on.
Today however, the golden standard is sdk-api (and kernel/virt/etc. friends, respectively) as was first most aptly demonstrated by msdocsviewer in our community (but also by Microsoft in win32metadata (the link points directly to one of their sdk-api parser.)
Unfortunately, none of the projects above also seem to generate lengthy, helpful metadata for all the details of each and every method, especially not in a parsable way.
This is why, I believe, the person behind Hexacorn went on to write their own parser (which, in it's structure, is very similar to sparse.)
Matter of fact, I'm actually fairly sure that they use the exact same expectations to extract data that we do, as we seem to have the same parsing limitations for common constant values.
So, in some ways, this is just an open-source way to reproduce the same outputs without needing to write your own parser for their data (like I have in the past!)


