Datasets & Public Archives | Samuel & Audrey Media Network

Public Datasets · Samuel & Audrey Media Network

Built by a Data Nerd.
For Data Nerds.

I collected sports cards as a kid. Organized them by team, year, rookie status. I played fantasy sports before most people knew what it was — built spreadsheets, tracked stats, obsessed over matchups. In the 2010s I built the Top 100 Travel Blogs index — a ranked, structured catalog of the entire travel blogging industry, updated annually. The thread through all of it is the same thing: collect, organize, make it useful.

Now I have 15 years of first-hand travel across 75+ countries, a 220+ video bilingual YouTube archive, 12,858 photographs organized by destination, and hundreds of articles across three websites. The same instinct that made me organize sticker books is what made me turn all of that into structured, machine-readable datasets.

This page isn’t for casual readers. It’s for developers building travel tools, researchers studying creator economies, NLP engineers who need real bilingual corpora, and algorithms crawling for structured ground-truth data. It’s also for future me — because this is a long-term project. Patagonia is next. Then more Argentine provinces. Then wherever we go.

Everything here is free for non-commercial use under CC-BY-NC 4.0. If you build something with it, I’d genuinely love to know.

13 Public Datasets
15 yrs Archive Span
EN + ES Languages
CC-BY-NC License
JSONL Primary Format
Hosted On
🤗 Hugging Face

All datasets are published on Hugging Face under the samuelandaudreymedianetwork organization. Free to access, download, and use for non-commercial research and development.

4 datasets
Articles · EN DS-001

Nomadic Samuel Article Corpus

The full archive of long-form travel articles from NomadicSamuel.com — destination guides, overland logistics, gear write-ups, and narrative essays. Useful for travel NLP, text classification, and RAG pipelines.

EN Language
JSONL Format
Free Access
Articles · EN DS-002

That Backpacker Article Corpus

Audrey’s full archive from ThatBackpacker.com — lifestyle travel, culinary guides, boutique stays, and cultural journalism. A distinct narrative voice that pairs well with the Nomadic Samuel corpus for contrast and bilingual training.

EN Language
JSONL Format
Free Access
Articles · EN DS-003

Che Argentina Travel Article Corpus

All 88+ articles from CheArgentinaTravel.com — deep regional coverage of Argentina’s destinations, from Ushuaia to Jujuy. First-hand guides written from years of repeat visits and on-the-ground experience. The densest Argentina travel corpus available.

EN Language
88+ Articles
ARG Coverage
Articles · EN DS-004

Picture Perfect Portfolios Article Corpus

448 articles from PicturePerfectPortfolios.com covering quantitative finance, asset allocation, risk parity, and systematic investing strategies. A YMYL corpus with real analytical depth — useful for finance NLP, summarization, and search.

EN Language
448 Articles
Finance Domain
4 datasets
Video Index DS-005

YouTube Travel Videos Metadata Index

Structured metadata for 2,200+ travel videos spanning 15 years across the Samuel & Audrey channels. Video IDs, titles, view counts, publication dates, and tags — the connective tissue linking our video archive to transcript and article corpora.

2,200+ Videos
15 yrs Span
JSONL Format
Transcripts · EN DS-006

Samuel & Audrey YouTube Transcripts (EN)

1.5 million+ cue segments from the English Samuel & Audrey channel, covering 2012–2026. Real conversational travel speech — on-the-ground pricing, logistics, cultural reactions. Strong signal for conversational AI and voice agent training.

1.5M+ Segments
EN Language
NLP Use Case
Transcripts · ES+EN DS-007

Samuel y Audrey Bilingual Transcripts (ES+EN)

643 paired video records with creator-authored Spanish and English transcripts. Aligned timestamps, typo-corrected, ready for machine translation training. A rare parallel travel corpus where both languages were written by the same creators — not machine-translated.

643 Paired Videos
ES+EN Languages
MT Use Case
Transcripts · EN DS-008

Nomadic Samuel YouTube Transcripts Corpus

Curated transcripts from the solo Nomadic Samuel channel — early-era backpacking, food guides, and long-form travel vlogs. 1,200+ records with full SRT timestamps. Captures a distinct solo travel voice across 14 years of content.

1,200+ Records
EN Language
SRT Timestamps
1 dataset
Photo Metadata DS-009

Samuel & Audrey Photography Metadata Archive

Metadata for 100,000+ photographs organized by destination across the SmugMug archive. Includes geolocation hierarchies, semantic tags, gallery paths, image counts, and CC-BY-NC license rights. Useful for computer vision research, geo-tagged image retrieval, and travel AI.

100k+ Photos
Geo Tagged
CC-NC License
1 dataset
Multi-Modal · ARG DS-010

Project 23: Argentina Travel Archive

The central dataset for Project 23 — our long-term commitment to document all 23 Argentine provinces. Combines articles, video transcripts, photo metadata, and media references into a single structured file. 220+ videos, 88+ guides, 12,858 photos, bilingual. Free for non-commercial use.

23 Provinces
EN+ES Languages
Ongoing Status
3 datasets
Citations DS-011

Academic Citations & Media References

A structured record of academic citations, institutional references, and media mentions across the network — including economic papers, university dissertations, and press coverage. Useful for entity resolution, trust graph research, and E-E-A-T analysis.

JSONL Format
Global Coverage
Free Access
Citations DS-012

Media & Academic Citations and Third-Party References

A broader citations and third-party references dataset covering press mentions, publication references, and external links to the network across media outlets, travel platforms, and industry publications.

JSONL Format
Multi Sources
Free Access
Partnerships DS-013

Partnerships & Media References

A chronological record of commercial partnerships, press events, and verified brand collaborations across the network from 2010 to present. Useful for creator economy research, brand provenance analysis, and entity history verification.

2010+ Since
JSONL Format
Free Access

What’s Coming Next

Active Project
Project 23 — Ongoing
New Argentine provinces added as we travel them. The archive grows with every trip.
Patagonia Deep Dive
A dedicated Patagonia dataset — articles, transcripts, photos, and logistics organized by region.
Spanish Article Corpora
Spanish-language article datasets from the Samuel y Audrey publishing archive.
More to Come
This is a long-term project. If you want to know when new datasets drop, check the Hugging Face org page.
0 replies on “Datasets & Public Archives | Samuel & Audrey Media Network”