ProductExtractor

# ProductExtractor

An automated data pipeline that transforms a list of website links into a clean, organized product database. Designed for e-commerce product catalog extraction with image downloading and Excel export.

Architecture

Parsing system architecture diagram...

Pipeline Stages

Stage 1: Input Reading

Accepts two types of source files in the input/ folder:

python

def read_input_files(input_dir: str):
    files = os.listdir(input_dir)
    
    for file in files:
        if file.endswith('.txt'):
            # Read sitemap URL
            with open(os.path.join(input_dir, file)) as f:
                sitemap_url = f.read().strip()
                extract_sitemap(sitemap_url)
                
        elif file.endswith('.xlsx'):
            # Extract all links from Excel
            df = pd.read_excel(os.path.join(input_dir, file))
            extract_excel_links(df)

.txt files: Single URL to a sitemap.xml file
.xlsx files: Any cells containing web links

Stage 2: Link Extraction

python

def extract_sitemap(sitemap_url: str) -> List[str]:
    response = requests.get(sitemap_url)
    tree = ET.fromstring(response.content)
    
    urls = []
    for elem in tree.findall('.//{*}loc'):
        urls.append(elem.text)
    
    return urls

Stage 3: Web Page Downloading

Downloads HTML for each product page:

python

def download_html(url: str, output_dir: str):
    response = requests.get(url, headers={
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
    })
    
    filename = sanitize_filename(url) + '.html'
    filepath = os.path.join(output_dir, 'Webpages', filename)
    
    with open(filepath, 'w', encoding='utf-8') as f:
        f.write(response.text)
    
    time.sleep(1)  # Rate limiting

Creates organized folder structure:

code

output/
└── source_name_output/
    └── Webpages/
        ├── product-page-1.html
        └── product-page-2.html

Stage 4: Data & Image Scraping

The core extraction parses HTML to extract:

python

def scrape_product_page(html_file: str) -> dict:
    with open(html_file, 'r', encoding='utf-8') as f:
        soup = BeautifulSoup(f.read(), 'lxml')
    
    product = {
        'category': extract_css(soup, '.product-category'),
        'name': extract_css(soup, '.product-title'),
        'price': extract_css(soup, '.price-inr'),
        'image_url': extract_css(soup, '.product-image img', attr='src'),
        'description': extract_css(soup, '.product-description ul'),
        'specifications': extract_table(soup, '.specifications-table')
    }
    
    return product

Data Field	CSS Selector Example
Category	`.product-category`
Name	`.product-title`
Price (₹)	`.price-inr`
Primary Image	`.product-image img`
Description	`.product-description ul`
Specifications	`.specifications-table`

Stage 5: Image Downloading & Reporting

python

def download_product_images(products: List[dict], output_dir: str):
    image_dir = os.path.join(output_dir, 'Product_Images')
    os.makedirs(image_dir, exist_ok=True)
    
    for product in products:
        if product['image_url']:
            filename = sanitize_filename(product['name']) + '.jpg'
            filepath = os.path.join(image_dir, filename)
            
            # Download image
            response = requests.get(product['image_url'])
            with open(filepath, 'wb') as f:
                f.write(response.content)

python

def generate_excel_report(products: List[dict], output_file: str):
    # Remove duplicates based on product name
    df = pd.DataFrame(products)
    df = df.drop_duplicates(subset=['name'])
    
    # Export to Excel
    df.to_excel(output_file, index=False, engine='openpyxl')

Project Structure

code

ProductExtractor/
├── input/                          # Source files go here
│   ├── sitemap_list.txt           # Contains sitemap URL
│   └── links.xlsx                 # Excel with product links
│
├── output/                        # Generated automatically
│   └── source_output/
│       ├── Webpages/              # Downloaded HTML files
│       ├── Product_Images/        # Downloaded images
│       └── products.xlsx         # Final Excel report
│
├── scraper.py                     # Main script
└── Instruction.md                 # User documentation

Usage

Step 1: Prepare Input

Option A: Sitemap

code

# file: my_sitemap.txt
https://example.com/sitemap.xml

Option B: Excel Place any .xlsx file with links in column cells.

Step 2: Run

bash

python scraper.py

Step 3: Results

code

output/
├── anshulimpex_output/
│   ├── Webpages/
│   │   ├── product-1.html
│   │   └── product-2.html
│   ├── Product_Images/
│   │   ├── product-1.jpg
│   │   └── product-2.jpg
│   └── anshulimpex_products.xlsx

Key Features

Multi-source input: Accepts sitemap URLs or Excel link lists
Automatic deduplication: Removes duplicate products by name
Image downloading: Saves product images with proper naming
Rate limiting: 1-second delay between requests to be polite
Error handling: Continues on individual URL failures
Excel export: Clean formatted spreadsheet output

Dependencies

bash

pip install pandas openpyxl beautifulsoup4 lxml requests

Limitations

Website-specific: CSS selectors tuned for specific site structure
Anti-scraping: May be blocked by sites with strong protections
HTML changes: Selector updates needed when site redesigns

Use Cases

Competitor analysis: Extract product catalogs for comparison
Price monitoring: Regular extraction for price tracking
Inventory management: Sync product data to internal systems
Data backup: Create offline copy of online product database

Architecture Feedback

Spotted a potential optimization or antipattern? Let me know.

Architecture

Pipeline Stages

Stage 1: Input Reading

Stage 2: Link Extraction

Stage 3: Web Page Downloading

Stage 4: Data & Image Scraping

Stage 5: Image Downloading & Reporting

Project Structure

Usage

Step 1: Prepare Input

Step 2: Run

Step 3: Results

Key Features

Dependencies

Limitations

Use Cases

Architecture Feedback

Submit a Technical Suggestion

Let's architect your next system.