Releases: SichangHe/DeGenTWeb_docs
Releases · SichangHe/DeGenTWeb_docs
Data from Mar 23 Bing Search Results Page Feature Analysis
feature label n_ai_pages n_human_pages ai_true_pages human_true_pages pct_ai_pages pct_human_pages pct_ai_minus_human pct_ai_over_human
5 has_meta_generator meta generator 39513.0 27668.0 26453.0 16388.0 66.947587 59.230880 7.716706 1.130282
0 html_is_minified HTML is minified 39513.0 27668.0 10775.0 6223.0 27.269506 22.491687 4.777819 1.212426
1 has_og OG 39513.0 27668.0 36594.0 24854.0 92.612558 89.829406 2.783152 1.030983
4 has_canonical canonical 39513.0 27668.0 37783.0 26008.0 95.621694 94.000289 1.621405 1.017249
3 has_review_intent_heading review intent heading 39513.0 27668.0 10390.0 6977.0 26.295143 25.216857 1.078286 1.042761
8 has_schema_org_faq schema.org FAQ 39513.0 27668.0 3279.0 2024.0 8.298535 7.315310 0.983225 1.134406
12 has_contact_link contact link 39513.0 27668.0 30706.0 21645.0 77.711133 78.231170 -0.520037 0.993353
7 has_schema_org_product schema.org product 39513.0 27668.0 1371.0 1110.0 3.469744 4.011855 -0.542111 0.864873
10 has_privacy_link privacy link 39513.0 27668.0 30919.0 22074.0 78.250196 79.781697 -1.531501 0.980804
9 has_schema_org_breadcrumb schema.org breadcrumb 39513.0 27668.0 23393.0 16891.0 59.203300 61.048865 -1.845565 0.969769
11 has_terms_link terms link 39513.0 27668.0 21847.0 16133.0 55.290664 58.309238 -3.018574 0.948232
6 has_schema_org_article schema.org article 39513.0 27668.0 15893.0 12548.0 40.222205 45.352031 -5.129826 0.886889
2 has_bread bread 39513.0 27668.0 12321.0 10498.0 31.182143 37.942750 -6.760607 0.821821 feature label n_ai n_human median_ai median_human median_ai_minus_human median_ai_over_human mean_ai mean_human mean_ai_minus_human mean_ai_over_human ks_stat ks_pvalue ks_qvalue mw_stat mw_pvalue mw_qvalue
p50_n_outgoing_domains Per-site median selected-page number of outgoing domains 2651.0 1855.0 5.000000 8.000000 -3.000000 0.625000 6.177480 9.239084 -3.061603 0.668625 0.250390 1.396060e-60 1.074966e-58 1604730.5 2.399777e-88 1.847828e-86
p50_self_link_ratio Per-site median selected-page self link ratio 2651.0 1855.0 0.902655 0.850746 0.051909 1.061015 0.867189 0.816066 0.051123 1.062646 0.181059 9.821776e-32 3.781384e-30 3030167.0 2.333718e-40 8.984814e-39
p50_url_hyphen_path_ratio Per-site median selected-page URL hyphen path ratio 2651.0 1855.0 0.122222 0.109375 0.012847 1.117460 0.118520 0.105595 0.012924 1.122396 0.179360 3.778037e-31 9.696962e-30 3026464.0 7.641418e-40 1.961297e-38
p50_h1_text_num_chars Per-site median selected-page H1 text num chars 2651.0 1855.0 50.000000 44.000000 6.000000 1.136364 48.355148 42.438545 5.916603 1.139416 0.171154 2.169576e-28 4.176433e-27 3003572.5 7.745509e-37 1.491010e-35
p50_num_h2 Per-site median selected-page number of H2 2651.0 1855.0 8.000000 6.000000 2.000000 1.333333 8.832893 7.115364 1.717529 1.241383 0.167966 2.359694e-27 3.633930e-26 2968161.5 1.557613e-32 2.398724e-31
p50_n_outgoing_links Per-site median selected-page number of outgoing links 2651.0 1855.0 64.000000 86.000000 -22.000000 0.744186 90.664466 115.020755 -24.356288 0.788244 0.164703 2.578326e-26 3.308852e-25 1960206.5 4.031372e-31 5.173594e-30
p50_anchor_chars_mean Per-site median selected-page anchor chars mean 2651.0 1855.0 17.793104 15.867256 1.925848 1.121372 22.562378 19.151495 3.410883 1.178100 0.156434 8.999445e-24 8.661966e-23 2902631.0 5.305948e-25 3.714164e-24
p50_anchor_text_num_chars Per-site median selected-page anchor text num chars 2651.0 1855.0 17.793104 15.867256 1.925848 1.121372 22.562378 19.151495 3.410883 1.178100 0.156434 8.999445e-24 8.661966e-23 2902631.0 5.305948e-25 3.714164e-24
p50_anchor_text_num_words Per-site median selected-page anchor text num words 2651.0 1855.0 2.803571 2.538461 0.265110 1.104437 3.414127 3.016298 0.397829 1.131893 0.154936 2.509941e-23 1.932655e-22 2869415.5 1.243322e-21 6.838273e-21
p50_anchor_words_mean Per-site median selected-page anchor words mean 2651.0 1855.0 2.803571 2.538461 0.265110 1.104437 3.414127 3.016298 0.397829 1.131893 0.154936 2.509941e-23 1.932655e-22 2869415.5 1.243322e-21 6.838273e-21
p50_external_link_ratio Per-site median selected-page external link ratio 2651.0 1855.0 0.086957 0.125000 -0.038043 0.695652 0.119899 0.152393 -0.032494 0.786777 0.145978 9.511947e-21 6.658363e-20 1988650.0 7.034704e-28 7.738174e-27
p50_image_missing_alt_ratio Per-site median selected-page image missing alt ratio 2651.0 1855.0 0.000000 0.000000 0.000000 NaN 0.043538 0.073167 -0.029629 0.595054 0.143914 3.547109e-20 2.276062e-19 2111991.5 6.544427e-24 4.199341e-23
p50_n_images_missing_alt Per-site median selected-page number of images missing alt 2651.0 1855.0 0.000000 0.000000 0.000000 NaN 1.177480 2.043127 -0.865646 0.576313 0.142944 6.542483e-20 3.875163e-19 2093079.5 1.805274e-26 1.737576e-25
p50_meta_description_num_chars Per-site median selected-page meta description num chars 2651.0 1855.0 146.000000 139.000000 7.000000 1.050360 145.548096 130.043396 15.504700 1.119227 0.136528 3.378708e-18 1.858289e-17 2771282.5 3.215750e-13 1.303225e-12
p50_h1_text_num_words Per-site median selected-page H1 text num words 2651.0 1855.0 8.000000 7.000000 1.000000 1.142857 7.860619 7.016981 0.843637 1.120228 0.135237 7.311538e-18 3.753256e-17 2899552.0 4.907212e-25 3.714164e-24
p50_n_images Per-site median selected-page number of images 2651.0 1855.0 13.000000 18.000000 -5.000000 0.722222 22.166352 27.517251 -5.350898 0.805544 0.123911 4.632146e-15 2.229220e-14 2058918.0 1.304041e-20 6.694075e-20
p50_meta_description_function_word_ratio Per-site median selected-page meta description function word ratio 2651.0 1855.0 0.357143 0.384615 -0.027473 0.928571 0.317524 0.335206 -0.017682 0.947249 0.123380 6.181277e-15 2.799755e-14 2168528.0 1.300097e-11 4.767022e-11
p50_self_links Per-site median selected-page self links 2651.0 1855.0 54.000000 68.000000 -14.000000 0.794118 79.822520 96.002426 -16.179906 0.831464 0.119574 4.711568e-14 2.015504e-13 2116688.0 1.709766e-15 8.228251e-15
p50_anchor_text_function_word_ratio Per-site median selected-page anchor text function word ratio 2651.0 1855.0 0.115487 0.103105 0.012382 1.120096 0.125076 0.111486 0.013591 1.121904 0.109921 6.105293e-12 2.474251e-11 2793679.0 6.590877e-15 2.819431e-14
p50_n_forms Per-site median selected-page number of forms 2651.0 1855.0 2.000000 2.000000 0.000000 1.000000 1.875896 2.249865 -0.373969 0.833781 0.105377 5.224777e-11 2.011539e-10 2215686.5 7.485116e-09 2.305416e-08
p50_image_alt_coverage Per-site median selected-page image alt coverage 2651.0 1855.0 0.830769 0.764045 0.066724 1.087330 0.710092 0.683082 0.027010 1.039542 0.102475 1.963149e-10 6.871021e-10 2680866.5 1.960546e-07 5.205586e-07
p50_image_non_empty_alt_ratio Per-site median selected-page image non empty alt ratio 2651...Data from Mar 23 Common Crawl page feature analysis
feature label n_ai_pages n_human_pages ai_true_pages human_true_pages pct_ai_pages pct_human_pages pct_ai_minus_human pct_ai_over_human
5 has_meta_generator meta generator 39513.0 27668.0 26453.0 16388.0 66.947587 59.230880 7.716706 1.130282
0 html_is_minified HTML is minified 39513.0 27668.0 10775.0 6223.0 27.269506 22.491687 4.777819 1.212426
1 has_og OG 39513.0 27668.0 36594.0 24854.0 92.612558 89.829406 2.783152 1.030983
4 has_canonical canonical 39513.0 27668.0 37783.0 26008.0 95.621694 94.000289 1.621405 1.017249
3 has_review_intent_heading review intent heading 39513.0 27668.0 10390.0 6977.0 26.295143 25.216857 1.078286 1.042761
8 has_schema_org_faq schema.org FAQ 39513.0 27668.0 3279.0 2024.0 8.298535 7.315310 0.983225 1.134406
12 has_contact_link contact link 39513.0 27668.0 30706.0 21645.0 77.711133 78.231170 -0.520037 0.993353
7 has_schema_org_product schema.org product 39513.0 27668.0 1371.0 1110.0 3.469744 4.011855 -0.542111 0.864873
10 has_privacy_link privacy link 39513.0 27668.0 30919.0 22074.0 78.250196 79.781697 -1.531501 0.980804
9 has_schema_org_breadcrumb schema.org breadcrumb 39513.0 27668.0 23393.0 16891.0 59.203300 61.048865 -1.845565 0.969769
11 has_terms_link terms link 39513.0 27668.0 21847.0 16133.0 55.290664 58.309238 -3.018574 0.948232
6 has_schema_org_article schema.org article 39513.0 27668.0 15893.0 12548.0 40.222205 45.352031 -5.129826 0.886889
2 has_bread bread 39513.0 27668.0 12321.0 10498.0 31.182143 37.942750 -6.760607 0.821821 feature label n_ai n_human median_ai median_human median_ai_minus_human median_ai_over_human mean_ai mean_human mean_ai_minus_human mean_ai_over_human ks_stat ks_pvalue ks_qvalue mw_stat mw_pvalue mw_qvalue
p50_n_outgoing_domains Per-site median selected-page number of outgoing domains 2651.0 1855.0 5.000000 8.000000 -3.000000 0.625000 6.177480 9.239084 -3.061603 0.668625 0.250390 1.396060e-60 1.074966e-58 1604730.5 2.399777e-88 1.847828e-86
p50_self_link_ratio Per-site median selected-page self link ratio 2651.0 1855.0 0.902655 0.850746 0.051909 1.061015 0.867189 0.816066 0.051123 1.062646 0.181059 9.821776e-32 3.781384e-30 3030167.0 2.333718e-40 8.984814e-39
p50_url_hyphen_path_ratio Per-site median selected-page URL hyphen path ratio 2651.0 1855.0 0.122222 0.109375 0.012847 1.117460 0.118520 0.105595 0.012924 1.122396 0.179360 3.778037e-31 9.696962e-30 3026464.0 7.641418e-40 1.961297e-38
p50_h1_text_num_chars Per-site median selected-page H1 text num chars 2651.0 1855.0 50.000000 44.000000 6.000000 1.136364 48.355148 42.438545 5.916603 1.139416 0.171154 2.169576e-28 4.176433e-27 3003572.5 7.745509e-37 1.491010e-35
p50_num_h2 Per-site median selected-page number of H2 2651.0 1855.0 8.000000 6.000000 2.000000 1.333333 8.832893 7.115364 1.717529 1.241383 0.167966 2.359694e-27 3.633930e-26 2968161.5 1.557613e-32 2.398724e-31
p50_n_outgoing_links Per-site median selected-page number of outgoing links 2651.0 1855.0 64.000000 86.000000 -22.000000 0.744186 90.664466 115.020755 -24.356288 0.788244 0.164703 2.578326e-26 3.308852e-25 1960206.5 4.031372e-31 5.173594e-30
p50_anchor_chars_mean Per-site median selected-page anchor chars mean 2651.0 1855.0 17.793104 15.867256 1.925848 1.121372 22.562378 19.151495 3.410883 1.178100 0.156434 8.999445e-24 8.661966e-23 2902631.0 5.305948e-25 3.714164e-24
p50_anchor_text_num_chars Per-site median selected-page anchor text num chars 2651.0 1855.0 17.793104 15.867256 1.925848 1.121372 22.562378 19.151495 3.410883 1.178100 0.156434 8.999445e-24 8.661966e-23 2902631.0 5.305948e-25 3.714164e-24
p50_anchor_text_num_words Per-site median selected-page anchor text num words 2651.0 1855.0 2.803571 2.538461 0.265110 1.104437 3.414127 3.016298 0.397829 1.131893 0.154936 2.509941e-23 1.932655e-22 2869415.5 1.243322e-21 6.838273e-21
p50_anchor_words_mean Per-site median selected-page anchor words mean 2651.0 1855.0 2.803571 2.538461 0.265110 1.104437 3.414127 3.016298 0.397829 1.131893 0.154936 2.509941e-23 1.932655e-22 2869415.5 1.243322e-21 6.838273e-21
p50_external_link_ratio Per-site median selected-page external link ratio 2651.0 1855.0 0.086957 0.125000 -0.038043 0.695652 0.119899 0.152393 -0.032494 0.786777 0.145978 9.511947e-21 6.658363e-20 1988650.0 7.034704e-28 7.738174e-27
p50_image_missing_alt_ratio Per-site median selected-page image missing alt ratio 2651.0 1855.0 0.000000 0.000000 0.000000 NaN 0.043538 0.073167 -0.029629 0.595054 0.143914 3.547109e-20 2.276062e-19 2111991.5 6.544427e-24 4.199341e-23
p50_n_images_missing_alt Per-site median selected-page number of images missing alt 2651.0 1855.0 0.000000 0.000000 0.000000 NaN 1.177480 2.043127 -0.865646 0.576313 0.142944 6.542483e-20 3.875163e-19 2093079.5 1.805274e-26 1.737576e-25
p50_meta_description_num_chars Per-site median selected-page meta description num chars 2651.0 1855.0 146.000000 139.000000 7.000000 1.050360 145.548096 130.043396 15.504700 1.119227 0.136528 3.378708e-18 1.858289e-17 2771282.5 3.215750e-13 1.303225e-12
p50_h1_text_num_words Per-site median selected-page H1 text num words 2651.0 1855.0 8.000000 7.000000 1.000000 1.142857 7.860619 7.016981 0.843637 1.120228 0.135237 7.311538e-18 3.753256e-17 2899552.0 4.907212e-25 3.714164e-24
p50_n_images Per-site median selected-page number of images 2651.0 1855.0 13.000000 18.000000 -5.000000 0.722222 22.166352 27.517251 -5.350898 0.805544 0.123911 4.632146e-15 2.229220e-14 2058918.0 1.304041e-20 6.694075e-20
p50_meta_description_function_word_ratio Per-site median selected-page meta description function word ratio 2651.0 1855.0 0.357143 0.384615 -0.027473 0.928571 0.317524 0.335206 -0.017682 0.947249 0.123380 6.181277e-15 2.799755e-14 2168528.0 1.300097e-11 4.767022e-11
p50_self_links Per-site median selected-page self links 2651.0 1855.0 54.000000 68.000000 -14.000000 0.794118 79.822520 96.002426 -16.179906 0.831464 0.119574 4.711568e-14 2.015504e-13 2116688.0 1.709766e-15 8.228251e-15
p50_anchor_text_function_word_ratio Per-site median selected-page anchor text function word ratio 2651.0 1855.0 0.115487 0.103105 0.012382 1.120096 0.125076 0.111486 0.013591 1.121904 0.109921 6.105293e-12 2.474251e-11 2793679.0 6.590877e-15 2.819431e-14
p50_n_forms Per-site median selected-page number of forms 2651.0 1855.0 2.000000 2.000000 0.000000 1.000000 1.875896 2.249865 -0.373969 0.833781 0.105377 5.224777e-11 2.011539e-10 2215686.5 7.485116e-09 2.305416e-08
p50_image_alt_coverage Per-site median selected-page image alt coverage 2651.0 1855.0 0.830769 0.764045 0.066724 1.087330 0.710092 0.683082 0.027010 1.039542 0.102475 1.963149e-10 6.871021e-10 2680866.5 1.960546e-07 5.205586e-07
p50_image_non_empty_alt_ratio Per-site median selected-page image non empty alt ratio 2651.0...Zstandard Dictionary for AWS Common Crawl S3 Request Service
zstd-dict-aws-cc-s3 docs(CC): S3 note
Data for e32b054 & 9276c06
Data for 5c55ee0
2K URLs
Camera Ready Version for IMC 2025 Poster & Student Workshop Presentation Slides
degentweb_imc2025poster202509301503.pdfis the 2-page paper.Poster_DeGenTWeb_IMC2025_202510241725.pdfis the A0-sized poster.Poster_DeGenTWeb_IMC2025_202510241725.pptxis the poster in PowerPoint on macOS.- Presentation slides for IMC 2025 Student Workshop in Google Slides, or in PowerPoint (
IMC2025SW.Did.I.Just.Browser.A.Website.Written.By.LLMs._.Sichang.He.pptxbelow).
Preprint: Submission for IMC 2025 Student Workshop
preprint-imc2025-sw chore: reduce irrelevant info for prompts
Assets for IMC 2025 Student Workshop
Used in src/degentweb/classifying/imc2025sw.py.
degentweb_pipeline.pdf
cdf_baseline_svm_scores.pdf
cdf_search_subdomain_scores.pdf
degentweb_pipeline_ni.pdf
Data for #29 in main repo, CDFs of various SVMs


cdf_svm_personal_binoculars.pdf
cdf_svm_personal_fast_npr.pdf

cdf_svm_personal_fast_detect_gpt.pdf

cdf_svm_personal_entropy.pdf

cdf_svm_personal_lrr.pdf

cdf_svm_personal_log_rank.pdf

cdf_svm_personal_log_p.pdf


cdf_svm_company_binoculars.pdf
cdf_svm_company_fast_npr.pdf

cdf_svm_company_fast_detect_gpt.pdf


cdf_svm_company_lrr.pdf
cdf_svm_company_entropy.pdf

cdf_svm_company_log_rank.pdf

cdf_svm_company_log_p.pdf