# ProductExtractor
An automated data pipeline that transforms a list of website links into a clean, organized product database. Designed for e-commerce product catalog extraction with image downloading and Excel export.
Architecture
Pipeline Stages
Stage 1: Input Reading
Accepts two types of source files in the input/ folder:
python
def read_input_files(input_dir: str):
files = os.listdir(input_dir)
for file in files:
if file.endswith('.txt'):
# Read sitemap URL
with open(os.path.join(input_dir, file)) as f:
sitemap_url = f.read().strip()
extract_sitemap(sitemap_url)
elif file.endswith('.xlsx'):
# Extract all links from Excel
df = pd.read_excel(os.path.join(input_dir, file))
extract_excel_links(df)- `.txt` files: Single URL to a
sitemap.xmlfile - `.xlsx` files: Any cells containing web links
Stage 2: Link Extraction
python
def extract_sitemap(sitemap_url: str) -> List[str]:
response = requests.get(sitemap_url)
tree = ET.fromstring(response.content)
urls = []
for elem in tree.findall('.//{*}loc'):
urls.append(elem.text)
return urlsStage 3: Web Page Downloading
Downloads HTML for each product page:
python
def download_html(url: str, output_dir: str):
response = requests.get(url, headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
})
filename = sanitize_filename(url) + '.html'
filepath = os.path.join(output_dir, 'Webpages', filename)
with open(filepath, 'w', encoding='utf-8') as f:
f.write(response.text)
time.sleep(1) # Rate limitingCreates organized folder structure:
code
output/
└── source_name_output/
└── Webpages/
├── product-page-1.html
└── product-page-2.htmlStage 4: Data & Image Scraping
The core extraction parses HTML to extract:
python
def scrape_product_page(html_file: str) -> dict:
with open(html_file, 'r', encoding='utf-8') as f:
soup = BeautifulSoup(f.read(), 'lxml')
product = {
'category': extract_css(soup, '.product-category'),
'name': extract_css(soup, '.product-title'),
'price': extract_css(soup, '.price-inr'),
'image_url': extract_css(soup, '.product-image img', attr='src'),
'description': extract_css(soup, '.product-description ul'),
'specifications': extract_table(soup, '.specifications-table')
}
return product| Data Field | CSS Selector Example |
|---|---|
| Category | .product-category |
| Name | .product-title |
| Price (₹) | .price-inr |
| Primary Image | .product-image img |
| Description | .product-description ul |
| Specifications | .specifications-table |
Stage 5: Image Downloading & Reporting
python
def download_product_images(products: List[dict], output_dir: str):
image_dir = os.path.join(output_dir, 'Product_Images')
os.makedirs(image_dir, exist_ok=True)
for product in products:
if product['image_url']:
filename = sanitize_filename(product['name']) + '.jpg'
filepath = os.path.join(image_dir, filename)
# Download image
response = requests.get(product['image_url'])
with open(filepath, 'wb') as f:
f.write(response.content)python
def generate_excel_report(products: List[dict], output_file: str):
# Remove duplicates based on product name
df = pd.DataFrame(products)
df = df.drop_duplicates(subset=['name'])
# Export to Excel
df.to_excel(output_file, index=False, engine='openpyxl')Project Structure
code
ProductExtractor/
├── input/ # Source files go here
│ ├── sitemap_list.txt # Contains sitemap URL
│ └── links.xlsx # Excel with product links
│
├── output/ # Generated automatically
│ └── source_output/
│ ├── Webpages/ # Downloaded HTML files
│ ├── Product_Images/ # Downloaded images
│ └── products.xlsx # Final Excel report
│
├── scraper.py # Main script
└── Instruction.md # User documentationUsage
Step 1: Prepare Input
Option A: Sitemap
code
# file: my_sitemap.txt
https://example.com/sitemap.xmlOption B: Excel Place any .xlsx file with links in column cells.
Step 2: Run
bash
python scraper.pyStep 3: Results
code
output/
├── anshulimpex_output/
│ ├── Webpages/
│ │ ├── product-1.html
│ │ └── product-2.html
│ ├── Product_Images/
│ │ ├── product-1.jpg
│ │ └── product-2.jpg
│ └── anshulimpex_products.xlsxKey Features
- Multi-source input: Accepts sitemap URLs or Excel link lists
- Automatic deduplication: Removes duplicate products by name
- Image downloading: Saves product images with proper naming
- Rate limiting: 1-second delay between requests to be polite
- Error handling: Continues on individual URL failures
- Excel export: Clean formatted spreadsheet output
Dependencies
bash
pip install pandas openpyxl beautifulsoup4 lxml requestsLimitations
- Website-specific: CSS selectors tuned for specific site structure
- Anti-scraping: May be blocked by sites with strong protections
- HTML changes: Selector updates needed when site redesigns
Use Cases
- Competitor analysis: Extract product catalogs for comparison
- Price monitoring: Regular extraction for price tracking
- Inventory management: Sync product data to internal systems
- Data backup: Create offline copy of online product database
Architecture Feedback
Spotted a potential optimization or antipattern? Let me know.