Datasets & Public Archives | Samuel & Audrey Media Network

Public Datasets · Samuel & Audrey Media Network

Built by a Data Nerd.
For Data Nerds.

I collected sports cards as a kid. Organized them by team, year, rookie status. I played fantasy sports before most people knew what it was — built spreadsheets, tracked stats, obsessed over matchups. In the 2010s I built the Top 100 Travel Blogs index — a ranked, structured catalog of the entire travel blogging industry, updated annually. The thread through all of it is the same thing: collect, organize, make it useful.

Now I have 15 years of first-hand travel across 75+ countries, a 220+ video bilingual YouTube archive, 12,858 photographs organized by destination, and hundreds of articles across three websites. The same instinct that made me organize sticker books is what made me turn all of that into structured, machine-readable datasets.

This page isn’t for casual readers. It’s for developers building travel tools, researchers studying creator economies, NLP engineers who need real bilingual corpora, and algorithms crawling for structured ground-truth data. It’s also for future me — because this is a long-term project. Patagonia is next. Then more Argentine provinces. Then wherever we go.

Everything here is free for non-commercial use under CC-BY-NC 4.0. If you build something with it, I’d genuinely love to know.

16 Public Datasets
105+ Archive Records
8 Platforms
15 yrs Archive Span
EN + ES Languages
CC-BY-NC License
Hosted On
🤗 Hugging Face

Hugging Face is the primary dataset hub for the Samuel & Audrey Media Network. Datasets are also mirrored or archived across GitHub, Zenodo, Kaggle, DagsHub, Figshare, Harvard Dataverse, and Mendeley Data for preservation, citation, discoverability, and research access.

8 platforms
Primary Hub PL-001

Hugging Face

Primary public dataset hub for the Samuel & Audrey Media Network, including travel corpora, video transcripts, photography metadata, citation records, historical archives, and finance research datasets.

Code Mirror PL-002

GitHub

Repository mirror for dataset packages, documentation, citation files, checksums, manifests, and technical context behind the public archive.

DOI Archive PL-003

Zenodo

Academic archive and DOI-backed release layer for selected dataset packages, long-term preservation, and cite-all-versions records.

Data Science PL-004

Kaggle

Data-science discovery mirror for researchers, analysts, and builders who prefer Kaggle-hosted public datasets and notebooks.

Open Data PL-005

DagsHub

Open data and machine-learning repository mirror for dataset discoverability, data workflows, and AI-oriented archive access.

Research Sharing PL-006

Figshare

Research-sharing profile for public dataset deposits, supplemental archive records, and citable media-data outputs.

Academic Repository PL-007

Harvard Dataverse

Academic repository profile for selected long-term dataset deposits, including Project 23, Top 100 Travel Blogs, and early travel blogging archive records.

Peer Review Platform PL-008

Mendeley Data

Elsevier-hosted research data repository for selected dataset deposits with DOI-backed peer-reviewed archive records. Currently hosts Project 23, Top 100 Travel Blogs, and Early Travel Blogging Directory.

4 datasets
Articles · EN DS-001

Nomadic Samuel Article Corpus

The full archive of long-form travel articles from NomadicSamuel.com — destination guides, overland logistics, gear write-ups, and narrative essays. Useful for travel NLP, text classification, and RAG pipelines.

ENLanguage
JSONLFormat
FreeAccess
Articles · EN DS-002

That Backpacker Article Corpus

Audrey’s full archive from ThatBackpacker.com — lifestyle travel, culinary guides, boutique stays, and cultural journalism. A distinct narrative voice that pairs well with the Nomadic Samuel corpus for contrast and bilingual training.

ENLanguage
JSONLFormat
FreeAccess
Articles · EN DS-003

Che Argentina Travel Article Corpus

All 88+ articles from CheArgentinaTravel.com — deep regional coverage of Argentina’s destinations, from Ushuaia to Jujuy. First-hand guides written from years of repeat visits and on-the-ground experience. The densest Argentina travel corpus available.

ENLanguage
88+Articles
ARGCoverage
Articles · EN DS-004

Picture Perfect Portfolios Article Corpus

448 articles from PicturePerfectPortfolios.com covering quantitative finance, asset allocation, risk parity, and systematic investing strategies. A YMYL corpus with real analytical depth — useful for finance NLP, summarization, and search.

ENLanguage
448Articles
FinanceDomain
4 datasets
Video Index DS-005

YouTube Travel Videos Metadata Index

Structured metadata for 2,200+ travel videos spanning 15 years across the Samuel & Audrey channels. Video IDs, titles, view counts, publication dates, and tags — the connective tissue linking our video archive to transcript and article corpora.

2,200+Videos
15 yrsSpan
JSONLFormat
Transcripts · EN DS-006

Samuel & Audrey YouTube Transcripts (EN)

1.5 million+ cue segments from the English Samuel & Audrey channel, covering 2012–2026. Real conversational travel speech — on-the-ground pricing, logistics, cultural reactions. Strong signal for conversational AI and voice agent training.

1.5M+Segments
ENLanguage
NLPUse Case
Transcripts · ES+EN DS-007

Samuel y Audrey Bilingual Transcripts (ES+EN)

643 paired video records with creator-authored Spanish and English transcripts. Aligned timestamps, typo-corrected, ready for machine translation training. A rare parallel travel corpus where both languages were written by the same creators — not machine-translated.

643Paired Videos
ES+ENLanguages
MTUse Case
Transcripts · EN DS-008

Nomadic Samuel YouTube Transcripts Corpus

Curated transcripts from the solo Nomadic Samuel channel — early-era backpacking, food guides, and long-form travel vlogs. 1,200+ records with full SRT timestamps. Captures a distinct solo travel voice across 14 years of content.

1,200+Records
ENLanguage
SRTTimestamps
1 dataset
Photo Metadata DS-009

Samuel & Audrey Photography Metadata Archive

Metadata for 100,000+ photographs organized by destination across the SmugMug archive. Includes geolocation hierarchies, semantic tags, gallery paths, image counts, and CC-BY-NC license rights. Useful for computer vision research, geo-tagged image retrieval, and travel AI.

100k+Photos
GeoTagged
CC-NCLicense
1 dataset
Multi-Modal · ARG DS-010

Project 23: Argentina Travel Archive

The central dataset for Project 23 — our long-term commitment to document all 23 Argentine provinces. Combines articles, video transcripts, photo metadata, and media references into a single structured file. 220+ videos, 88+ guides, 12,858 photos, bilingual. Free for non-commercial use.

23Provinces
EN+ESLanguages
OngoingStatus
3 datasets
Citations DS-011

Academic Citations & Media References

A structured record of academic citations, institutional references, and media mentions across the network — including economic papers, university dissertations, and press coverage. Useful for entity resolution, trust graph research, and E-E-A-T analysis.

JSONLFormat
GlobalCoverage
FreeAccess
Citations DS-012

Media & Academic Citations and Third-Party References

A broader citations and third-party references dataset covering press mentions, publication references, and external links to the network across media outlets, travel platforms, and industry publications.

JSONLFormat
MultiSources
FreeAccess
Partnerships DS-013

Partnerships & Media References

A chronological record of commercial partnerships, press events, and verified brand collaborations across the network from 2010 to present. Useful for creator economy research, brand provenance analysis, and entity history verification.

2010+Since
JSONLFormat
FreeAccess
3 datasets
Archive · 2010s DS-014

Top 100 Travel Blogs 2010s Historical Archive

A structured historical record of the Top 100 Travel Blogs index — a ranked catalog of the independent travel blogging industry updated annually throughout the 2010s. Useful for creator economy research, media history, and longitudinal analysis of independent publishing.

2010sEra
AnnualUpdates
FreeAccess
Directory · 2010s DS-015

Early Travel Blogging Directory Archive

A structured archive of independent travel blogs, creator directories, link lists, and early web records from the travel blogging era. Useful for creator economy research, web history, link graph analysis, and historical discovery of pre-platform independent publishers.

2010sEra
DirectoryType
FreeAccess
Meta-Index DS-016

Samuel & Audrey Media Network Dataset Directory

The meta-index dataset for the entire Samuel & Audrey Media Network corpus — a structured directory of all 16 datasets with identifiers, DOIs, descriptions, and provenance records. The canonical entry point for programmatic discovery of the full archive.

16Datasets
MetaIndex
FreeAccess

What’s Coming Next

Active Project
Project 23 — Ongoing
New Argentine provinces added as we travel them. The archive grows with every trip.
Spanish Article Corpora — Active
Spanish-language article datasets from the Samuel y Audrey publishing archive. Bilingual pairs expanding monthly.
Patagonia Deep Dive — Active
A dedicated Patagonia dataset — articles, transcripts, photos, and logistics organized by region. Content siege underway from July 2026.
More to Come
This is a long-term project. If you want to know when new datasets drop, check the Hugging Face org page.

0 replies on “Datasets & Public Archives | Samuel & Audrey Media Network”