Added boogy man pipeline in yt to rss blog post

author: Charles Cabergs <me@cacharle.xyz> 2020-12-06 02:53:58 +0100
committer: Charles Cabergs <me@cacharle.xyz> 2020-12-06 02:53:58 +0100
commit: b1f372e474472ed238f6b48ac955f2548da27b5e (patch)
tree: e4373c545fecae640081ce73aabe3fdc6c957ec6
parent: 4b88c2da06514a9c69f8c4121f15549631e53675 (diff)
download: cacharle.xyz-b1f372e474472ed238f6b48ac955f2548da27b5e.tar.gz
cacharle.xyz-b1f372e474472ed238f6b48ac955f2548da27b5e.tar.bz2
cacharle.xyz-b1f372e474472ed238f6b48ac955f2548da27b5e.zip
1 files changed, 23 insertions, 25 deletions
diff --git a/blog/youtube_to_rss.md b/blog/youtube_to_rss.md
index 8b414e1..c96e31f 100644
--- a/blog/youtube_to_rss.md
+++ b/blog/youtube_to_rss.md
@@ -10,45 +10,43 @@ Right click on something that is not a link or an image and select `Save as`, gi
 
 ## Parse list of subscription
 
-Now let's get a tiny bit fancy with Python and BeautifulSoup.
+Get all the channels urls, replace `channels.html` with the HTML file you saved.
 
-Download Python from [here](https://www.python.org/downloads/) (If you're on Linux, you can install it from your package manager). Make sure you install Python3.\* and not Python2.
-
-Download BeautifulSoup with pip `pip3 install bs4` (sure?)
-
-```python
-#!/usr/bin/env python3
-
-
-from bs4 import BeautifulSoup
-
-BeautifulSoup(content)
-
-...
+```
+grep -o -E 'href="https://www.youtube.com/(c|channel|user)/[a-zA-Z0-9 ]+"' channels.html |
+    sort |
+    uniq |
+    sed 's/href="\(.*\)"/\1/' > channel_urls
 ```
 
-You can download this script [here](extract_channels.py),
-it's a bit different from the one in this article,
-read the code before running it if you want to make sure their isn't any shenenigans.
+Some channels don't aren't prefixed with `/c/` `/channel/` or `/user/` in the url
+so you'll either have to add them manually or change the `grep` regex to accept all url
+which begin with `https://www.youtube.com/` and remove the links which aren't youtube channels.
 
-```
-$ curl -O https://cacharle.xyz/blog/extract_channels.py
-$ chmod +x extract_channels.py
-$ ./extract_channels.py < channels.html
-```
+> Those channels are pretty rare tho, on my 300+ subscriptions I only had 2 or 3.
 
-## Choose the channel to add to your feeds
+## Choose the channels to add to your feeds
 
 Now comes the tedious and cringing part where you need to through aaaall your old and obscure subscription and filter the bad ones out.  
 > Protip: If you want to automate this part you can ask Google to do it for you since they know you better than yourself by now.
 
-## Get channel feed
+
+## Get channel info
 
 I guess most RSS reader understand the HTML /balise/ `<link rel="alternate" type="application/rss" .../>` but [newsboat]() (which I use) doesn't unfortunatly.  
 We can get the url to a channel feed with a simple `curl` into `grep`.
 
 ```
-curl <CHANNEL_URL> | grep -E '<link rel="alternate".*rss/>
+xargs -a channel_urls curl -s |
+    stdbuf -oL grep -o -E \
+        -e '<title>.* - YouTube</title>' \
+        -e 'https://www\.youtube\.com/feeds/videos\.xml\?channel_id=[a-zA-Z0-9_-]+' |
+    awk '!seen[$0]++' |
+    sed 's:<title>\(.*\) - YouTube</title>:\1:' |
+    sed 'N; s/\n/ /' |
+    sed 's/\(.*\)  # \(https:.*\)/\2  # \1/' |
+    tee /dev/stderr 2> urls |
+    cat -n
 ```
 
 ## Sources
author	Charles Cabergs <me@cacharle.xyz>	2020-12-06 02:53:58 +0100
committer	Charles Cabergs <me@cacharle.xyz>	2020-12-06 02:53:58 +0100
commit	b1f372e474472ed238f6b48ac955f2548da27b5e (patch)
tree	e4373c545fecae640081ce73aabe3fdc6c957ec6
parent	4b88c2da06514a9c69f8c4121f15549631e53675 (diff)
download	cacharle.xyz-b1f372e474472ed238f6b48ac955f2548da27b5e.tar.gz cacharle.xyz-b1f372e474472ed238f6b48ac955f2548da27b5e.tar.bz2 cacharle.xyz-b1f372e474472ed238f6b48ac955f2548da27b5e.zip