Skip to content

spider-rs/spider_transformations

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

spider_transformations

A high-performance transformation library for Rust, used by Spider Cloud for AI-powered content cleaning across multiple locales.

This project depends on the spider crate.

Usage

[dependencies]
spider_transformations = "2"
use spider_transformations::transformation::content;

fn main() {
    // page comes from the spider object when streaming.
    let mut conf = content::TransformConfig::default();
    conf.return_format = content::ReturnFormat::Markdown;
    let content = content::transform_content(&page, &conf, &None, &None, &None);
}

Transform types

  1. Markdown
  2. Commonmark
  3. Text
  4. Markdown (Text Map) or HTML2Text
  5. HTML2XML

Document formats (feature: document)

Convert Office documents directly to markdown with zero panics and zero locks:

  1. Excel (.xlsx)
  2. Word (.docx)
  3. PowerPoint (.pptx)

Enable with:

[dependencies]
spider_transformations = { version = "2", features = ["document"] }

Document conversion is automatic — binary files matching Office formats are detected and converted to markdown tables and text. No configuration needed beyond enabling the feature.

Enhancements

  1. Readability
  2. Encoding

Chunking

There are several chunking utils in the transformation mod.

This project has rewrites and forks of html2md, and html2text for performance and bug fixes.

License

MIT

Packages

 
 
 

Contributors

Languages