This project implements a Python web scraper to extract job postings from jobs.dou.ua. It can scrape data from both the main website search results and the RSS feed, providing flexibility and comprehensive data collection.
- Website Scraping: Extracts job listings from the main search results pages on jobs.dou.ua, including:
- Job ID
- Job Title (Role)
- Company Name
- Company DOU.ua Page URL
- City/Location
- Job Detail Page URL
- Salary (if available)
- Detailed Job Description (with preserved formatting: paragraphs, lists)
- Extracted Skills (using keyword matching)
- Date Scraped
- RSS Feed Scraping: Fetches job listings from the jobs.dou.ua RSS feed, providing an alternative and often faster way to get data. Includes the same detailed information as website scraping by fetching each job's detail page.
- Filtering: Supports filtering by job category, remote work (boolean), and experience level (string).
- CSV Output: Saves the scraped data to a CSV file, with separate files for website and RSS feed scraping.
- Error Handling: Includes robust error handling for network requests, HTML/XML parsing, and data extraction.
- Modular Design: Uses separate functions for URL generation, data extraction, and detail page fetching, making the code well-organized and maintainable.
- Respectful Scraping: Includes delays between requests to avoid overloading the target website.
- Python 3.6+
requestslibrary (pip install requests)beautifulsoup4library (pip install beautifulsoup4)
The script provides two main functions:
main(category, remote=False, experience=None): Scrapes job data from the website's search results.rss_main(category, remote=False, experience=None): Scrapes job data from the RSS feed.
Both functions take the following arguments:
category(str): The job category to search for (e.g., "Python", "Data Science", "Analyst"). See below for a full list of valid categories.remote(bool, optional): IfTrue, filters for remote jobs. Defaults toFalse.experience(str, optional): Filters by experience level. Valid options are: "0-1", "1-3", "3-5", "5plus". If not specified, all experience levels are included.
Example Usage (in a Python script or Jupyter Notebook):
# from your_script_name import main, rss_main # If running from a separate script
# Scrape remote Python jobs with 1-3 years of experience from the website:
main("Python", remote=True, experience="1-3")
# Scrape Data Science jobs (all locations and experience levels) from the RSS feed:
rss_main("Data Science")
# Scrape Analyst jobs in Kyiv, all experience levels, from the website:
main("Analyst", remote=False) # No need to specify experience if you want all.Valid Job Categories:
".NET", "Account Manager", "AI/ML", "Analyst", "Android", "Animator", "Architect", "Artist", "Assistant", "Big Data", "Blockchain", "C++", "C-level", "Copywriter", "Data Engineer", "Data Science", "DBA", "Design", "DevOps", "Embedded", "Engineering Manager", "Erlang", "ERP/CRM", "Finance", "Flutter", "Front End", "Golang", "Hardware", "HR", "iOS/macOS", "Java", "Legal", "Marketing", "Node.js", "Office Manager", "Other", "PHP", "Product Manager", "Project Manager", "Python", "QA", "React Native", "Ruby", "Rust", "Sales", "Salesforce", "SAP", "Scala", "Scrum Master", "Security", "SEO", "Support", "SysAdmin", "Technical Writer", "Unity", "Unreal Engine", "Військова справа"
Output:
The script creates CSV files in the same directory where it's run.
mainfunction creates files named:{category}_jobs_{YYYYMMDD_HHMMSS}.csvrss_mainfunction creates files named:rss_{category}_jobs_{YYYYMMDD_HHMMSS}.csv
CSV File Structure:
The CSV files have the following columns:
id: The unique job ID from the DOU.ua URL.title: The job title (role).company: The company name.company_dou_url: The URL of the company's page on DOU.ua.city: The job location (city or "Remote").url: The URL of the job detail page.salary: The salary (if available).detailed_description: The full job description, with paragraphs and lists preserved.skills: A list of extracted skills (keywords).date_scraped: The date and time when the data was scraped.
- Website Analysis: The HTML structure of jobs.dou.ua was analyzed using browser developer tools to identify the relevant tags and attributes for data extraction. The RSS feed structure was also examined.
- Core Scraping Implementation:
requestsis used to fetch HTML content, andBeautifulSoupis used for parsing. Functions were created to extract data from both the main search results page and individual job detail pages. - Detailed Information Extraction: The script fetches each job's detail page to extract the full description, skills, company information, and role.
- RSS Feed Integration: The
xml.etree.ElementTreemodule is used to parse the XML data from the RSS feed. A separate function (rss_main) handles RSS feed scraping, providing an alternative data source. - Skill Extraction: A keyword-based approach is used to extract skills from the detailed job descriptions.
- Error Handling:
try...exceptblocks are used throughout the code to handle potential network errors, parsing errors, and missing data. - Code modularity: Code refactored and splitted by function for better readibility.
- Respectful Scraping: Added delays
This solution achieves the project objectives by providing a robust and flexible way to scrape job data from jobs.dou.ua. It can retrieve data from both the main website and the RSS feed, offering alternative data sources. The script extracts comprehensive information, including detailed descriptions and skills. The use of functions and error handling makes the code maintainable and reliable. The CSV output provides a structured format for further analysis and use.
Developed a Python web scraper using Requests, BeautifulSoup, and XML parsing to extract comprehensive job data, including detailed descriptions and skills, from jobs.dou.ua, saving the results to CSV files. The solution supports both website and RSS feed scraping, providing flexibility and robust data capture.
This project is licensed under the MIT License - see the LICENSE file for details.