Extracting m3u8 Links from a Website using Python and Scraping a Sitemap XML to Save Data to a CSV File

In this tutorial, we’ll show you how to use Python to extract data from a sitemap and save it to a CSV file. This can be a useful tool for SEO professionals or anyone looking to analyze the data on a website.

First, let’s make sure you have the necessary libraries installed. You’ll need to install csv, os, selenium, requests, and beautifulsoup4. You can do this by running the following command in your terminal:

pip install csv os selenium requests beautifulsoup4

Next, you’ll need to download the ChromeDriver executable and set the path to it in your script. You can download ChromeDriver from the following link: ChromeDriver Download.

Once you have the libraries and ChromeDriver installed, you can start by importing them into your script:

import csv
import os
from selenium import webdriver
import requests
from bs4 import BeautifulSoup
import time

Next, create a webdriver object to control the Chrome browser and set the path to the ChromeDriver executable:

driver = webdriver.Chrome()

Now we can use the requests library to make a GET request to the sitemap URL. In this example, we’re using the sitemap for the website https://www.webcamromania.ro/:

response = requests.get('https://www.webcamromania.ro/post-sitemap.xml')
soup = BeautifulSoup(response.content, 'xml')

We can use BeautifulSoup to parse the XML content of the sitemap and extract all the loc tags. These tags contain the URLs of the pages on the website:

links = []
for loc in soup.find_all('loc'):
link = loc.text
if not any(link.endswith(ext) for ext in ['.jpg', '.png']):
links.append(link)

In the above code, we’re storing the URLs in a list called links and filtering out any URLs that end with ‘.jpg’ or ‘.png’. This is because these URLs are likely to be image files rather than pages on the website.

Now we can create a CSV file to store the data that we extract from the pages. In this example, we’re creating a file called ‘data.csv’ on the user’s desktop:

with open('C:/Users/USERNAME/Desktop/data.csv', 'w', newline='', encoding='utf-8') as csvfile:
fieldnames = ['link', 'title', 'description', 'og_image', 'og_video', 'm3u8']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()

We’re using the csv library to create the file and write the field names ‘link’, ‘title’, ‘description’, ‘og_image’, ‘og_video’, ‘m3u8’ as the header.

Now that we have a CSV file set up to store the data, we can iterate over the links list and extract the data from each page. Here’s the code to do that:

for link in links:
driver.get(link)
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')

# Extract the title and description
title = soup.find('title').text
description = soup.find('meta', {'property': 'og:description'})['content'] if soup.find('meta', {'property': 'og:description'}) else ''

# Extract the og:image and og:video links (if present)
og_image = soup.find('meta', {'property': 'og:image'})['content'] if soup.find('meta', {'property': 'og:image'}) else ''
og_video = soup.find('meta', {'property': 'og:video'})['content'] if soup.find('meta', {'property': 'og:video'}) else ''

# If og:video is present, extract the m3u8 link
m3u8 = ''
if og_video:
driver.get(og_video)
time.sleep(1) # wait for 3 seconds
html = driver.page_source
soup = BeautifulSoup(driver.page_source, 'html.parser')
video_element = soup.find("video")
source_element = video_element.find("source") if video_element else None
m3u8 = source_element.get("src") if source_element else "m3u8 link not found"
print(m3u8)

# Write the data to the CSV file
with open('C:/Users/USERNAME/Desktop/data.csv', 'a', newline='', encoding='utf-8') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writerow({'link': link, 'title': title, 'description': description, 'og_image': og_image, 'og_video': og_video, 'm3u8': m3u8})

For each link in the links list, the script does the following:

  1. Navigates to the link in the Chrome browser using the webdriver.
  2. Extracts the HTML content of the page and uses BeautifulSoup to parse it.
  3. Extracts the page’s title and description from the HTML.
  4. Extracts the og:image and og:video links from the HTML (if present).
  5. If an og:video link is present, navigates to it and extracts the m3u8 link from the HTML.
  6. Appends the data (link, title, description, og_image, og_video, m3u8) to the ‘data.csv’ file.

Finally, we can close the Chrome browser by calling the close() method on the webdriver object:

driver.close()

And that’s it! You now have a script that can extract data from a sitemap and save it to a CSV file. You can modify the script to fit your specific needs, such as changing the sitemap URL or the data that you want to extract.

One thing to keep in mind is that web scraping can be resource-intensive and may take a long time to run, especially if the sitemap contains a large number of pages. You may want to consider adding a delay between requests to avoid overloading the server or running into rate limiting issues. You could also consider running the script on a server with more resources or using a cloud service to distribute the workload.

 

This script does the following:

  1. Imports the necessary libraries (csv, os, selenium, requests, and BeautifulSoup).
  2. Sets the path to the ChromeDriver executable and creates a webdriver object to control the Chrome browser.
  3. Makes a GET request to the URL ‘https://www.webcamromania.ro/post-sitemap.xml‘ and uses BeautifulSoup to parse the XML content.
  4. Extracts all the loc tags from the XML content and stores their text values in a list called links.
  5. Filters out any links that end with ‘.jpg’ or ‘.png’ from the links list.
  6. Creates a CSV file called ‘data.csv’ on the user’s desktop and writes the field names ‘link’, ‘title’, ‘description’, ‘og_image’, ‘og_video’, ‘m3u8’ as the header row.
  7. Iterates over the links list and performs the following actions for each link:
  8. Navigate to the link in the Chrome browser using the webdriver.
  9. Extract the HTML content of the page and use BeautifulSoup to parse it.
  10. Extract the page’s title and description from the HTML.
  11. Extract the og:image and og:video links from the HTML (if present).
  12. If an og:video link is present, navigate to it and extract the m3u8 link from the HTML.
  13. Append the data (link, title, description, og_image, og_video, m3u8) to the ‘data.csv’ file.
  14. Close the Chrome browser.

Full Script:

import csv
import os
from selenium import webdriver
import requests
from bs4 import BeautifulSoup
import time

# Set the path to the ChromeDriver executable
driver = webdriver.Chrome()

# Read the sitemap XML file
response = requests.get('https://www.webcamromania.ro/post-sitemap.xml')
soup = BeautifulSoup(response.content, 'xml')

# Extract the links from the sitemap
links = []
for loc in soup.find_all('loc'):
link = loc.text
if not any(link.endswith(ext) for ext in ['.jpg', '.png']):
links.append(link)

# Create a CSV file to store the data
with open('C:/Users/claudiu2/Desktop/data.csv', 'w', newline='', encoding='utf-8') as csvfile:
fieldnames = ['link', 'title', 'description', 'og_image', 'og_video', 'm3u8']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()

# Iterate over the links and extract the data
for link in links:
driver.get(link)
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')

# Extract the title and description
title = soup.find('title').text
description = soup.find('meta', {'property': 'og:description'})['content'] if soup.find('meta', {'property': 'og:description'}) else ''

# Extract the og:image and og:video links (if present)
og_image = soup.find('meta', {'property': 'og:image'})['content'] if soup.find('meta', {'property': 'og:image'}) else ''
og_video = soup.find('meta', {'property': 'og:video'})['content'] if soup.find('meta', {'property': 'og:video'}) else ''

# If og:video is present, extract the m3u8 link
m3u8 = ''
if og_video:
driver.get(og_video)
time.sleep(1) # wait for 3 seconds
html = driver.page_source
soup = BeautifulSoup(driver.page_source, 'html.parser')
video_element = soup.find("video")
source_element = video_element.find("source") if video_element else None
m3u8 = source_element.get("src") if source_element else "m3u8 link not found"
print(m3u8)

# Write the data to the CSV file
with open('C:/Users/claudiu2/Desktop/data.csv', 'a', newline='', encoding='utf-8') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writerow({'link': link, 'title': title, 'description': description, 'og_image': og_image, 'og_video': og_video, 'm3u8': m3u8})

# Close the webdriver
driver.close()

This is how the csv should look like.

 

Was this helpful?

1 / 1

Leave a Reply 0

Your email address will not be published. Required fields are marked *