Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
56 changes: 40 additions & 16 deletions docs/userGuide/makingTheSiteSearchable.md
Original file line number Diff line number Diff line change
Expand Up @@ -101,34 +101,58 @@ In your `site.json`:

This tells Pagefind to exclude any element with the `algolia-no-index` class (or containing it in a space-separated list) from the search index, similar to using `data-pagefind-ignore`.

For more details, see the [Pagefind documentation on exclude selector configuration option](https://pagefind.app/docs/config-options/#exclude-selectors).

#### Limiting Which Pages Are Searchable

You can use the `glob` option to limit which pages are indexed by Pagefind. This is useful when you want search results to only show pages from specific sections of your site.
Pagefind uses the existing `searchable` property in your `pages` configuration to determine which pages should be indexed. This provides a seamless way to control search indexing without additional configuration.

In your `site.json`:

```json
{
"pagefind": {
"glob": [
"devGuide",
"userGuide/*"
]
}
"pages": [
{
"src": "index.md"
},
{
"src": "internal/notes.md",
"searchable": "no"
},
{
"glob": "devGuide/**/*.md",
"searchable": "no"
}
]
}
```

MarkBind supports glob patterns and will automatically append `.html` to your patterns if not specified. For example:
- `"devGuide"` becomes `"devGuide/**/*.html"`
- `"devGuide/*"` becomes `"devGuide/*.html"`
- `"**/devGuide/**"` becomes `"**/devGuide/**/*.html"`
- `"*.html"` remains `"*.html"` (no change needed)
- Pages with `searchable: "no"` (or `false`) will not appear in search results
- By default, all pages are searchable (`searchable: "yes"`)

For more details on the `searchable` property, see [site.json file documentation](siteJsonFile.html#pages).

<panel header="Interaction with Pagefind Attributes">

MarkBind controls page indexing through the `searchable` property, which determines whether pages are passed to Pagefind for indexing. However, Pagefind also provides native attributes that can affect indexing:

Only pages matching these glob patterns will appear in search results. This can be particularly useful for:
- Multi-site setups where you want to search only specific sections
- Including only certain directories from search results
- [**`data-pagefind-body`**](https://pagefind.app/docs/indexing/#limiting-what-sections-of-a-page-are-indexed): Marks a specific element as the search content container. When this attribute exists on ANY page of your site, pages WITHOUT this attribute will not be indexed.
- [**`data-pagefind-ignore`**](https://pagefind.app/docs/indexing/#removing-individual-elements-from-the-index): Excludes specific elements from the search index.

For more details on glob patterns, see the [Pagefind documentation](https://pagefind.app/docs/config-options/#glob).
**How MarkBind handles this:**

1. Pages with `searchable: "no"` are NOT passed to Pagefind at all (they are never indexed).
2. Pages with `searchable: "yes"` (default) ARE passed to Pagefind for indexing.

**Interactions to be aware of:**

- If you add `data-pagefind-body` to a searchable page, it works as expected - the page is indexed. However, only pages with this attribute will be searchable.
- If you add `data-pagefind-body` to a non-searchable page, MarkBind will still NOT index it (because it's filtered before being passed to Pagefind).
- Adding `data-pagefind-ignore` to a searchable page will NOT prevent it from being indexed - the page is still added via MarkBind's indexing, but the content within that element will be ignored by Pagefind.

**Recommendation:** Use MarkBind's `searchable` property in site.json to control which pages are indexed & use `data-pagefind-body` attribute to exlcude specific elements within a page from being searchable. Avoid using `data-pagefind-body` as it is redundant and may lead to confusion.

</panel>

<panel header="Potential Future Enhancements">

Expand Down
2 changes: 1 addition & 1 deletion docs/userGuide/siteJsonFile.md
Original file line number Diff line number Diff line change
Expand Up @@ -135,7 +135,7 @@ _(Optional)_ **The styling options to be applied to the site.** This includes:
* **`globExclude`**: An array of file patterns to be excluded from rendering when using `glob`, also defined in the glob syntax.
* **`title`**: The page `<title>` for the generated web page. Titles specified here take priority over titles specified in the [frontmatter](tweakingThePageStructure.html#frontmatter) of individual pages.
* **`layout`**: The [layout](tweakingThePageStructure.html#layouts) to be used by the page. Default: `default`.
* **`searchable`**: Specifies that the page(s) should be excluded from searching. Default: `yes`.
* **`searchable`**: Specifies whether the page(s) should be included in search indexing. This applies to both the built-in search and Pagefind (if enabled). Set to `"no"` or `false` to exclude the page(s) from search results. Default: `yes`.
* **`externalScripts`**: An array of external scripts to be referenced on the page. Scripts referenced will be run before the layout script.
* **`frontmatter`**: Specifies properties to add to the frontmatter of a page or glob of pages. Overrides any existing properties if they have the same name, and overrides any frontmatter properties specified in `globalOverride`.
* **`fileExtension`**: A string that specifies the output file extension (e.g., `".json"`, `".txt"`) for the generated file. If not specified, defaults to `".html"`. Note that non-HTML files do not support frontmatter or scripts.
Expand Down
4 changes: 3 additions & 1 deletion packages/core/src/Page/page.njk
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,9 @@
<script>
const baseUrl = '{{ baseUrl }}'
</script>
<body {% if hasPageNavHeadings %} data-bs-spy="scroll" data-bs-target="#mb-page-nav" data-bs-offset="100" {% endif %} data-code-theme="{{ codeTheme }}">
<body
{% if hasPageNavHeadings %} data-bs-spy="scroll" data-bs-target="#mb-page-nav" data-bs-offset="100" {% endif %}
data-code-theme="{{ codeTheme }}">
{{ content }}
</body>
{{- pageUserScriptsAndStyles -}}
Expand Down
1 change: 0 additions & 1 deletion packages/core/src/Site/SiteConfig.ts
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,6 @@ export class SiteConfig {

pagefind?: {
exclude_selectors?: string[];
glob?: string | string[];
};

/**
Expand Down
131 changes: 33 additions & 98 deletions packages/core/src/Site/SiteGenerationManager.ts
Original file line number Diff line number Diff line change
Expand Up @@ -871,74 +871,6 @@ export class SiteGenerationManager {
1000,
);

/**
* Validates that a glob pattern is safe and won't traverse outside the output directory.
*
* @param pattern - The glob pattern to validate
* @returns true if the pattern is safe, false otherwise
*/
// eslint-disable-next-line class-methods-use-this
private isValidGlobPattern(pattern: string): boolean {
if (pattern.includes('..')) {
return false;
}

const isUnixAbsolutePath = pattern.startsWith('/');
const isWindowsAbsolutePath = /^[a-zA-Z]:[\\/]/.test(pattern);
if (isUnixAbsolutePath || isWindowsAbsolutePath) {
return false;
}

return true;
}

/**
* Normalizes a gitignore-style glob pattern to a valid Wax/Pagefind pattern
* by appending .html extension if not already present.
* Invalid patterns (e.g., path traversal attempts) are logged and return empty string.
*
* @param pattern - The glob pattern from user config (gitignore-style)
* @returns A valid Wax/Pagefind glob pattern, or empty string if invalid
*/
private normalizeGlobPattern(pattern: string): string {
const normalizedPattern = pattern.replace(/\\/g, '/');

if (!this.isValidGlobPattern(pattern)) {
logger.error(`Invalid glob pattern rejected (potential path traversal): ${pattern}`);
return '';
}

if (normalizedPattern.endsWith('.html')) {
return normalizedPattern;
}

if (normalizedPattern.endsWith('/**')) {
return `${normalizedPattern}/*.html`;
}

if (normalizedPattern.endsWith('/*')) {
return `${normalizedPattern}.html`;
}

if (normalizedPattern.endsWith('/')) {
return `${normalizedPattern}**/*.html`;
}

return `${normalizedPattern}/**/*.html`;
}

/**
* Indexes all HTML files in the output directory and logs any errors.
* @param index - The pagefind index instance
* @returns The number of pages indexed
*/
// eslint-disable-next-line class-methods-use-this
private async indexAllHtmlFiles(index: any): Promise<number> {
const result = await index.addDirectory({ path: this.outputPath });
result.errors.forEach((error: string) => logger.error(error));
return result.page_count;
}

/**
* Indexes all the pages of the site using pagefind.
* @returns true if indexing succeeded and pagefind assets were written, false otherwise.
Expand All @@ -963,40 +895,43 @@ export class SiteGenerationManager {

const { index } = await createIndex(createIndexOptions);
if (index) {
// Handle glob patterns - support both single string and array of strings
const globValue = pagefindConfig.glob;
const value = globValue ?? [];
const globPatterns = Array.isArray(value) ? value : [value];
// Filter pages that should be indexed (searchable !== false)
const searchablePages = this.sitePages.pages.filter(
page => page.pageConfig.searchable,
);

let totalPageCount = 0;

if (globPatterns.length > 0) {
const normalizedPatterns = globPatterns
.map(pattern => this.normalizeGlobPattern(pattern))
.filter(pattern => pattern !== '');

if (normalizedPatterns.length > 0) {
const results = await Promise.all(
normalizedPatterns.map(async (normalizedPattern) => {
logger.info(`Pagefind indexing with glob: ${normalizedPattern}`);
const result = await index.addDirectory({
path: this.outputPath,
glob: normalizedPattern,
});

result.errors.forEach((error: string) => logger.error(error));

return result.page_count;
}),
);

totalPageCount += results.reduce((acc, count) => acc + count, 0);
} else {
logger.warn('All glob patterns were invalid, falling back to indexing all HTML files');
totalPageCount = await this.indexAllHtmlFiles(index);
}
if (searchablePages.length === 0) {
logger.info('No pages configured for search indexing');
} else {
totalPageCount = await this.indexAllHtmlFiles(index);
// Add each searchable page to the index using addHTMLFile
const indexingResults = await Promise.all(
searchablePages.map(async (page) => {
try {
const content = await fs.readFile(page.pageConfig.resultPath, 'utf8');
const relativePath = path.relative(this.outputPath, page.pageConfig.resultPath);

return index.addHTMLFile({
sourcePath: relativePath,
content,
});
} catch (error) {
logger.error(`Failed to index ${page.pageConfig.resultPath}: ${error}`);
return null;
}
}),
);

// Count successful indexings
totalPageCount = indexingResults.filter(r => r !== null).length;

// Log any errors from indexing results
indexingResults.forEach((result) => {
if (result && result.errors) {
result.errors.forEach((error: string) => logger.error(error));
}
});
}

const endTime = new Date();
Expand Down
Loading
Loading