From b1f372e474472ed238f6b48ac955f2548da27b5e Mon Sep 17 00:00:00 2001 From: Charles Cabergs Date: Sun, 6 Dec 2020 02:53:58 +0100 Subject: Added boogy man pipeline in yt to rss blog post --- blog/youtube_to_rss.md | 48 +++++++++++++++++++++++------------------------- 1 file changed, 23 insertions(+), 25 deletions(-) diff --git a/blog/youtube_to_rss.md b/blog/youtube_to_rss.md index 8b414e1..c96e31f 100644 --- a/blog/youtube_to_rss.md +++ b/blog/youtube_to_rss.md @@ -10,45 +10,43 @@ Right click on something that is not a link or an image and select `Save as`, gi ## Parse list of subscription -Now let's get a tiny bit fancy with Python and BeautifulSoup. +Get all the channels urls, replace `channels.html` with the HTML file you saved. -Download Python from [here](https://www.python.org/downloads/) (If you're on Linux, you can install it from your package manager). Make sure you install Python3.\* and not Python2. - -Download BeautifulSoup with pip `pip3 install bs4` (sure?) - -```python -#!/usr/bin/env python3 - - -from bs4 import BeautifulSoup - -BeautifulSoup(content) - -... +``` +grep -o -E 'href="https://www.youtube.com/(c|channel|user)/[a-zA-Z0-9 ]+"' channels.html | + sort | + uniq | + sed 's/href="\(.*\)"/\1/' > channel_urls ``` -You can download this script [here](extract_channels.py), -it's a bit different from the one in this article, -read the code before running it if you want to make sure their isn't any shenenigans. +Some channels don't aren't prefixed with `/c/` `/channel/` or `/user/` in the url +so you'll either have to add them manually or change the `grep` regex to accept all url +which begin with `https://www.youtube.com/` and remove the links which aren't youtube channels. -``` -$ curl -O https://cacharle.xyz/blog/extract_channels.py -$ chmod +x extract_channels.py -$ ./extract_channels.py < channels.html -``` +> Those channels are pretty rare tho, on my 300+ subscriptions I only had 2 or 3. -## Choose the channel to add to your feeds +## Choose the channels to add to your feeds Now comes the tedious and cringing part where you need to through aaaall your old and obscure subscription and filter the bad ones out. > Protip: If you want to automate this part you can ask Google to do it for you since they know you better than yourself by now. -## Get channel feed + +## Get channel info I guess most RSS reader understand the HTML /balise/ `` but [newsboat]() (which I use) doesn't unfortunatly. We can get the url to a channel feed with a simple `curl` into `grep`. ``` -curl | grep -E ' +xargs -a channel_urls curl -s | + stdbuf -oL grep -o -E \ + -e '.* - YouTube' \ + -e 'https://www\.youtube\.com/feeds/videos\.xml\?channel_id=[a-zA-Z0-9_-]+' | + awk '!seen[$0]++' | + sed 's:\(.*\) - YouTube:\1:' | + sed 'N; s/\n/ /' | + sed 's/\(.*\) # \(https:.*\)/\2 # \1/' | + tee /dev/stderr 2> urls | + cat -n ``` ## Sources -- cgit