| description | Are you an LLM? Start here. |
|---|
OpenAlex is a fully open catalog of scholarly works, authors, sources, institutions, topics, publishers, and funders. Base URL: https://api.openalex.org Documentation: https://docs.openalex.org No authentication required | 100,000 requests/day limit
WRONG: ?page=5, ?page=17, ?page=42 to get "random" results This is NOT random sampling and will bias your results!
CORRECT: https://api.openalex.org/works?sample=20 For consistent results, add a seed: ?sample=20&seed=123
The sample parameter maxes out at reasonable sizes for a single request.
For large random samples (10k+ records):
- Make multiple sample requests with different seeds
- Combine results
- Deduplicate by ID Example:
- ?sample=1000&seed=1
- ?sample=1000&seed=2
- ?sample=1000&seed=3 Then deduplicate the combined results by checking work IDs. See: https://docs.openalex.org/how-to-use-the-api/get-lists-of-entities/sample-entity-lists
WRONG: /works?filter=author_name:Einstein Entity names are ambiguous and this won't work!
CORRECT two-step process:
- Find the entity ID: /authors?search=einstein Response shows ID like "A5023888391" or full URI
- Use ID to filter: /works?filter=authorships.author.id:A5023888391
Why? Names are ambiguous. "MIT" could be many institutions. IDs are unique. This applies to: authors, institutions, sources, topics, publishers, funders.
WRONG: You cannot do SQL-style "GROUP BY topic, year" in a single API call.
To analyze by topic AND year (or any two dimensions):
- Make one query per year: ?filter=publication_year:2020&group_by=topics.id
- Repeat for 2021, 2022, etc.
- Combine results in your code The API only supports one group_by per request.
API errors are common, especially at scale. Immediate retries can make things worse.
When you get errors (429 rate limit, 500 server error, timeouts):
- Catch the error
- Wait before retrying (1s, 2s, 4s, 8s, etc.)
- Include a max retry limit (e.g., 5 attempts)
- Log failures for debugging
Default is only 25 results per page. Slow for large extracts!
FAST: ?per-page=200 This reduces the number of API calls needed by 8x compared to default.
SLOW: Loop through 100 DOIs making 100 separate API calls.
FAST: Combine up to 50 IDs in one query using pipe separator: /works?filter=doi:https://doi.org/10.1371/journal.pone.0266781|https://doi.org/10.1371/journal.pone.0267149|... You can include up to 50 values per filter. Use per-page=50 to get all results. See: https://docs.openalex.org/how-to-use-the-api/get-lists-of-entities/filter-entity-lists#addition-or
Using multiple threads WITHOUT respecting rate limits will get you rate-limited or banned.
- Default pool: 1 request/second
- Polite pool (with email): 10 requests/second
- Daily limit: 100,000 requests When using threading/async:
- Implement rate limiting across ALL threads
- Track requests per second globally
- Add your email to requests for 10x higher limits
Base: https://api.openalex.org
Auth: None required
Rate: 100k requests/day
Add your email to ANY parameter:
https://api.openalex.org/works?mailto=yourname@example.edu
This increases your rate limit from 1 req/sec → 10 req/sec Always do this for production applications!
/works - 240M+ scholarly documents (articles, books, datasets)
/authors - Researcher profiles with disambiguated identities
/sources - Journals, repositories, conferences
/institutions - Universities, research organizations
/topics - Subject classifications (3-level hierarchy)
/publishers - Publishing organizations
/funders - Funding agencies
/text - Tag your own text with OpenAlex topics/keywords (POST)
filter= - Filter results (see filter syntax below)
search= - Full-text search across title/abstract/fulltext
sort= - Sort results (e.g., cited_by_count:desc)
per-page= - Results per page (default: 25, max: 200)
page= - Page number for pagination
sample= - Get random results (e.g., sample=50)
seed= - Seed for reproducible sampling
select= - Limit returned fields (e.g., select=id,title)
group_by= - Aggregate results by a field
mailto= - Your email (gets you into polite pool)
Single filter: ?filter=publication_year:2020
Multiple (AND): ?filter=publication_year:2020,is_oa:true
Values (OR): ?filter=type:journal-article|book
Negation: ?filter=type:!journal-article
Greater than: ?filter=cited_by_count:>100
Less than: ?filter=publication_year:<2020
Range: ?filter=publication_year:2020-2023
You can express AND within a single attribute two ways:
Repeat filter: ?filter=institutions.country_code:us,institutions.country_code:gb
Use + symbol: ?filter=institutions.country_code:us+gb
Both mean: "works with author from US AND author from GB"
Any of these: ?filter=institutions.country_code:us|gb|ca
Batch IDs: ?filter=doi:10.1/abc|10.2/def|10.3/ghi
You can combine up to 50 values with pipes.
POSSIBLE: ?filter=type:article|book (article OR book)
NOT POSSIBLE: Cannot do "(year=2020 OR year=2021) AND (type=article)"
WORKAROUND: Make separate queries and combine results
Small sample:
https://api.openalex.org/works?sample=20
Reproducible sample:
https://api.openalex.org/works?sample=20&seed=42
Large sample (10k+):
1. https://api.openalex.org/works?sample=1000&seed=1
2. https://api.openalex.org/works?sample=1000&seed=2
3. https://api.openalex.org/works?sample=1000&seed=3
...then deduplicate by ID
Simple search:
https://api.openalex.org/works?search=machine+learning
Search specific field:
https://api.openalex.org/works?filter=title.search:CRISPR
Search + filter:
https://api.openalex.org/works?search=climate&filter=publication_year:2023
Step 1 - Get author ID:
https://api.openalex.org/authors?search=Heather+Piwowar
Response includes: "id": "https://openalex.org/A5023888391"
Step 2 - Get their works:
https://api.openalex.org/works?filter=authorships.author.id:A5023888391
Alternative - Use ORCID directly:
https://api.openalex.org/works?filter=authorships.author.id:https://orcid.org/0000-0003-1613-5981
Step 1 - Get institution ID:
https://api.openalex.org/institutions?search=MIT
Response includes: "id": "https://openalex.org/I136199984"
Step 2 - Get their works:
https://api.openalex.org/works?filter=authorships.institutions.id:I136199984
Alternative - Use ROR directly:
https://api.openalex.org/works?filter=authorships.institutions.id:https://ror.org/042nb2s44
https://api.openalex.org/works?filter=publication_year:>2020&sort=cited_by_count:desc&per-page=200
All OA:
https://api.openalex.org/works?filter=is_oa:true
Gold OA only:
https://api.openalex.org/works?filter=open_access.oa_status:gold
Published OA version:
https://api.openalex.org/works?filter=has_oa_published_version:true
Recent OA works about COVID from top institutions:
https://api.openalex.org/works?filter=publication_year:2022,is_oa:true,title.search:covid,authorships.institutions.id:I136199984|I27837315
Breaking down the filters:
- publication_year:2022 (recent)
- is_oa:true (open access)
- title.search:covid (about COVID)
- authorships.institutions.id:I136199984|I27837315 (MIT or Harvard)
Get specific works by DOI (efficient batch method):
https://api.openalex.org/works?filter=doi:https://doi.org/10.1371/journal.pone.0266781|https://doi.org/10.1371/journal.pone.0267149|https://doi.org/10.1371/journal.pone.0267890&per-page=50
Up to 50 DOIs per request. Use per-page=50 to ensure you get all results.
Step 1 - Get source ID:
https://api.openalex.org/sources?search=Nature
Response includes: "id": "https://openalex.org/S137773608"
Step 2 - Get works from that source:
https://api.openalex.org/works?filter=primary_location.source.id:S137773608
Top topics by work count:
https://api.openalex.org/works?group_by=topics.id
Papers per year:
https://api.openalex.org/works?group_by=publication_year
Most prolific institutions:
https://api.openalex.org/works?group_by=authorships.institutions.id
Group with filters:
https://api.openalex.org/works?filter=publication_year:>2020&group_by=topics.id
First page:
https://api.openalex.org/works?filter=publication_year:2023&per-page=200
Next pages:
https://api.openalex.org/works?filter=publication_year:2023&per-page=200&page=2
https://api.openalex.org/works?filter=publication_year:2023&per-page=200&page=3
...
The meta.count field tells you total results.
Calculate pages needed: ceil(meta.count / per-page)
Just IDs and titles:
https://api.openalex.org/works?select=id,title&per-page=200
Multiple fields:
https://api.openalex.org/works?select=id,title,publication_year,cited_by_count
Fast autocomplete endpoint for building search UIs:
Authors:
https://api.openalex.org/autocomplete/authors?q=einst
Institutions:
https://api.openalex.org/autocomplete/institutions?q=stanford
Works:
https://api.openalex.org/autocomplete/works?q=neural+networks
Typically returns in ~200ms
POST or GET to classify your own content:
https://api.openalex.org/text?title=Machine+learning+for+drug+discovery
Returns topics, keywords, and concepts for your text.
Limited to 1 req/sec and 1000 req/day.
Text must be 20-2000 characters.
All list endpoints (/works, /authors, etc.) return:
{
"meta": {
"count": 240523418,
"db_response_time_ms": 42,
"page": 1,
"per_page": 25
},
"results": [
{ /* entity object */ },
{ /* entity object */ },
...
]
}Getting a single entity returns the object directly:
https://api.openalex.org/works/W2741809807
→ Returns a Work object directly (no meta/results wrapper)
{
"meta": { "count": 100, ... },
"group_by": [
{
"key": "https://openalex.org/T10001",
"key_display_name": "Artificial Intelligence",
"count": 15234
},
...
]
}SLOW: Default 25 per page = more API calls
FAST: ?per-page=200 (8x fewer API calls)
SLOW: Loop through 50 DOIs, 50 API calls
FAST: One call with pipe-separated DOIs (up to 50)
SLOW: Full objects with all fields
FAST: ?select=id,title,publication_year
# Pseudo-code
from concurrent.futures import ThreadPoolExecutor
import time
rate_limiter = RateLimiter(10) # 10 req/sec with polite pool
def fetch_page(page_num):
rate_limiter.wait() # Ensure we don't exceed rate limit
return requests.get(f"...&page={page_num}&mailto=you@example.edu")
with ThreadPoolExecutor(max_workers=10) as executor:
results = executor.map(fetch_page, range(1, 101))WITHOUT: 1 request/second max
WITH: 10 requests/second max
Just add: &mailto=yourname@example.edu
200 OK - Success
400 Bad Request - Invalid parameter (check your filter syntax)
403 Forbidden - Rate limit exceeded (slow down, implement backoff)
404 Not Found - Entity doesn't exist
500 Server Error - Temporary issue (retry with backoff)
def fetch_with_retry(url, max_retries=5):
for attempt in range(max_retries):
try:
response = requests.get(url, timeout=30)
if response.status_code == 200:
return response.json()
elif response.status_code == 403:
# Rate limited
wait_time = 2 ** attempt # 1s, 2s, 4s, 8s, 16s
time.sleep(wait_time)
elif response.status_code >= 500:
# Server error
wait_time = 2 ** attempt
time.sleep(wait_time)
else:
# Other error, don't retry
response.raise_for_status()
except requests.exceptions.Timeout:
if attempt < max_retries - 1:
time.sleep(2 ** attempt)
else:
raise
raise Exception(f"Failed after {max_retries} retries")authorships.author.id - Author's OpenAlex ID
authorships.institutions.id - Institution's OpenAlex ID
cited_by_count - Number of citations
is_oa - Is open access (true/false)
publication_year - Year published
primary_location.source.id - Source (journal) ID
topics.id - Topic ID
type - article, book, dataset, etc.
has_doi - Has a DOI (true/false)
has_fulltext - Has fulltext available (true/false)
last_known_institution.id - Current/last institution
works_count - Number of works authored
cited_by_count - Total citations
orcid - ORCID identifier
host_organization - Publisher/host
type - journal, repository, etc.
is_oa - Is open access
type - education, healthcare, company, etc.
country_code - Two-letter country code
continent - africa, asia, europe, etc.
You can use external IDs directly in the API:
DOI: /works/https://doi.org/10.7717/peerj.4375
PMID: /works/pmid:29844763
ORCID: /authors/https://orcid.org/0000-0003-1613-5981
ROR: /institutions/https://ror.org/02y3ad647
ISSN: /sources/issn:0028-0836
Always use a seed for reproducible sampling:
https://api.openalex.org/works?sample=100&seed=42
Same seed = same results every time.
Get cited works:
1. Get work: /works/W2741809807
2. Response includes: "referenced_works": [...]
3. Fetch those: /works?filter=openalex_id:W123|W456|W789
Get citing works:
1. Get work: /works/W2741809807
2. Response includes: "cited_by_api_url"
3. Follow that URL
Exact year: ?filter=publication_year:2020
After: ?filter=publication_year:>2020
Before: ?filter=publication_year:<2020
Range: ?filter=publication_year:2018-2022
The search parameter supports boolean operators:
AND: ?search=climate+AND+change
OR: ?search=climate+OR+weather
NOT: ?search=climate+NOT+politics
- 1 request per second
- 100,000 requests per day
- Sequential processing recommended
- 10 requests per second
- 100,000 requests per day
- Parallel processing viable
- Always include your email for production use
1. Track requests per second globally (not per thread)
2. Use a semaphore or rate limiter across threads
3. Add delays between batches if needed
4. Monitor for 403 responses (rate limit exceeded)
5. Back off if you hit limits
With 100k/day limit:
- ~4,166 requests per hour average
- ~69 requests per minute average
- Plan accordingly for large jobs
- Consider OpenAlex Premium for higher limits
- ❌ Using page numbers for sampling → ✅ Use ?sample=
- ❌ Filtering by entity names → ✅ Get IDs first, then filter
- ❌ Default page size → ✅ Use per-page=200
- ❌ Sequential ID lookups → ✅ Batch with pipe (|) operator
- ❌ No error handling → ✅ Implement retry with backoff
- ❌ Ignoring rate limits in threads → ✅ Global rate limiting
- ❌ Trying to group by multiple fields → ✅ Multiple queries + combine
- ❌ Not including email → ✅ Add mailto= for 10x speed
- ❌ Fetching all fields → ✅ Use select= for needed fields only
- ❌ Assuming instant responses → ✅ Add timeouts (30s recommended)
- Full documentation: https://docs.openalex.org
- API Overview: https://docs.openalex.org/how-to-use-the-api/api-overview
- Entity schemas: https://docs.openalex.org/api-entities
- Help: https://openalex.org/help
- User group: https://groups.google.com/g/openalex-users
If you need:
- More than 100k requests/day
- Faster than daily snapshot updates
- Commercial support
- SLA guarantees
See: https://openalex.org/pricing
Last updated: 2025-10-13 Maintained for: LLM agents, AI applications, and automated tools