Initializing
Back to Projects
Year2024
DomainBackend
AccessOpen Source
Complexity0 / 10
PythonWeb ScrapingBeautifulSoupE-commerce
BackendArchived

ProductExtractor

An automated product data scraping pipeline that extracts product info and images from e-commerce websites via sitemap or Excel link lists, exporting to structured Excel files.

# ProductExtractor

An automated data pipeline that transforms a list of website links into a clean, organized product database. Designed for e-commerce product catalog extraction with image downloading and Excel export.

Architecture

Parsing system architecture diagram...

Pipeline Stages

Stage 1: Input Reading

Accepts two types of source files in the input/ folder:

python
def read_input_files(input_dir: str):
    files = os.listdir(input_dir)
    
    for file in files:
        if file.endswith('.txt'):
            # Read sitemap URL
            with open(os.path.join(input_dir, file)) as f:
                sitemap_url = f.read().strip()
                extract_sitemap(sitemap_url)
                
        elif file.endswith('.xlsx'):
            # Extract all links from Excel
            df = pd.read_excel(os.path.join(input_dir, file))
            extract_excel_links(df)
  • `.txt` files: Single URL to a sitemap.xml file
  • `.xlsx` files: Any cells containing web links
python
def extract_sitemap(sitemap_url: str) -> List[str]:
    response = requests.get(sitemap_url)
    tree = ET.fromstring(response.content)
    
    urls = []
    for elem in tree.findall('.//{*}loc'):
        urls.append(elem.text)
    
    return urls

Stage 3: Web Page Downloading

Downloads HTML for each product page:

python
def download_html(url: str, output_dir: str):
    response = requests.get(url, headers={
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
    })
    
    filename = sanitize_filename(url) + '.html'
    filepath = os.path.join(output_dir, 'Webpages', filename)
    
    with open(filepath, 'w', encoding='utf-8') as f:
        f.write(response.text)
    
    time.sleep(1)  # Rate limiting

Creates organized folder structure:

code
output/
└── source_name_output/
    └── Webpages/
        ├── product-page-1.html
        └── product-page-2.html

Stage 4: Data & Image Scraping

The core extraction parses HTML to extract:

python
def scrape_product_page(html_file: str) -> dict:
    with open(html_file, 'r', encoding='utf-8') as f:
        soup = BeautifulSoup(f.read(), 'lxml')
    
    product = {
        'category': extract_css(soup, '.product-category'),
        'name': extract_css(soup, '.product-title'),
        'price': extract_css(soup, '.price-inr'),
        'image_url': extract_css(soup, '.product-image img', attr='src'),
        'description': extract_css(soup, '.product-description ul'),
        'specifications': extract_table(soup, '.specifications-table')
    }
    
    return product
Data FieldCSS Selector Example
Category.product-category
Name.product-title
Price (₹).price-inr
Primary Image.product-image img
Description.product-description ul
Specifications.specifications-table

Stage 5: Image Downloading & Reporting

python
def download_product_images(products: List[dict], output_dir: str):
    image_dir = os.path.join(output_dir, 'Product_Images')
    os.makedirs(image_dir, exist_ok=True)
    
    for product in products:
        if product['image_url']:
            filename = sanitize_filename(product['name']) + '.jpg'
            filepath = os.path.join(image_dir, filename)
            
            # Download image
            response = requests.get(product['image_url'])
            with open(filepath, 'wb') as f:
                f.write(response.content)
python
def generate_excel_report(products: List[dict], output_file: str):
    # Remove duplicates based on product name
    df = pd.DataFrame(products)
    df = df.drop_duplicates(subset=['name'])
    
    # Export to Excel
    df.to_excel(output_file, index=False, engine='openpyxl')

Project Structure

code
ProductExtractor/
├── input/                          # Source files go here
│   ├── sitemap_list.txt           # Contains sitemap URL
│   └── links.xlsx                 # Excel with product links

├── output/                        # Generated automatically
│   └── source_output/
│       ├── Webpages/              # Downloaded HTML files
│       ├── Product_Images/        # Downloaded images
│       └── products.xlsx         # Final Excel report

├── scraper.py                     # Main script
└── Instruction.md                 # User documentation

Usage

Step 1: Prepare Input

Option A: Sitemap

code
# file: my_sitemap.txt
https://example.com/sitemap.xml

Option B: Excel Place any .xlsx file with links in column cells.

Step 2: Run

bash
python scraper.py

Step 3: Results

code
output/
├── anshulimpex_output/
│   ├── Webpages/
│   │   ├── product-1.html
│   │   └── product-2.html
│   ├── Product_Images/
│   │   ├── product-1.jpg
│   │   └── product-2.jpg
│   └── anshulimpex_products.xlsx

Key Features

  1. Multi-source input: Accepts sitemap URLs or Excel link lists
  2. Automatic deduplication: Removes duplicate products by name
  3. Image downloading: Saves product images with proper naming
  4. Rate limiting: 1-second delay between requests to be polite
  5. Error handling: Continues on individual URL failures
  6. Excel export: Clean formatted spreadsheet output

Dependencies

bash
pip install pandas openpyxl beautifulsoup4 lxml requests

Limitations

  • Website-specific: CSS selectors tuned for specific site structure
  • Anti-scraping: May be blocked by sites with strong protections
  • HTML changes: Selector updates needed when site redesigns

Use Cases

  1. Competitor analysis: Extract product catalogs for comparison
  2. Price monitoring: Regular extraction for price tracking
  3. Inventory management: Sync product data to internal systems
  4. Data backup: Create offline copy of online product database

Architecture Feedback

Spotted a potential optimization or antipattern? Let me know.

Submit a Technical Suggestion