DIY Repair Videos: A Multimodal YouTube Dataset for Instructional Content Analysis

DIY Repair Videos: A Multimodal YouTube Dataset for Instructional Content AnalysisOverview:This dataset contains 6 ,015 YouTube DIY‑repair tutorial videos, each enriched with structured metadata, transcripts, viewer comments, channel details, and a rigorous, multi‑round manual annotation of instructional content.Key Components:Metadata & EngagementFields: Video_ID, Title, Description, Duration (ISO 8601 + seconds), View_Count, Like_Count, Comment_Count, Published_At, Thumbnail_URLMetric: Engagement_Ratio = Like_Count / (View_Count + 1)Transcripts:Source: YouTube auto‑captions (empty if unavailable)Fields: Transcript (raw text)Manual Rounds:TR_A1, TR_A2, TR_A3 — three independent transcript reviews (correcting major errors, marking non‑verbal segments)TR_Final — consolidated transcript after consensusDIY Category AnnotationManual Rounds:DIY_A1, DIY_A2, DIY_A3 — three independent category assignments using the annotation guideDIY_Final — consensus category after adjudicationCoverage: 16 DIY sub‑domains (e.g., “home repair,” “plumbing,” “woodworking,” “other”)Reliability: Inter‑annotator agreement (Fleiss’s κ = 0.76)Comments:Fields: Comments (JSON array of up to 50 top‑level comments), Has_Comments (true if ≥ 20 total words)Channel Context:Fields: Channel_ID, Channel_Title, Channel_Thumbnail_URLAnnotation Methodology:1. Stratified Subset SelectionA subset of 180 videos was sampled to represent all DIY categories proportionally.2. Annotation GuideA concise manual defined each DIY category and outlined transcription conventions.3. Independent AnnotationsThree team members performed Round 1–3 (DIY_A1–3 and TR_A1–3) without access to others’ labels.4. Consensus AdjudicationFor each video, a fourth pass produced DIY_Final and TR_Final—the agreed‑upon labels and corrected transcript.DIY-Repair-Youtube-Dataset/│├── data/│ ├── video_metadata.csv # Main dataset file (6,015 rows × 19 columns)│ └── data_dictionary.csv # Definitions of each column/field│├── CITATION.cff├── LICENSE└── README.md└── requirements.txt#The dataset is annotated manually and reviewed the transcripts.

DIY Repair Videos: A Multimodal YouTube Dataset for Instructional Content Analysis

Description

Citations (0)

No citations found

Mentions (0)

No mentions found

Metrics

Metrics Over Time

Publication Details

Assigned Domain

Keywords

Normalization Factors