from docx import Document import re, os email_re = re.compile(r'[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+.[A-Za-z]{2,}') def extract_from_docx(path): doc = Document(path) texts = [] for p in doc.paragraphs: texts.append(p.text) for table in doc.tables: for row in table.rows: for cell in row.cells: texts.append(cell.text) matches = set() for t in texts: matches.update(email_re.findall(t)) return matches
Pros: Flexible, cross-platform, handles many parts of a .docx. Cons: Additional work for .doc or embedded objects.
Option C — VBA macro inside Word
For users comfortable with Word macros, VBA can loop through open documents or files in a folder and write found addresses to a new document or CSV. VBA can access headers/footers and comments but requires enabling macros.
Extracting from older .doc files and embedded content
- Convert .doc to .docx first (Word can batch-convert or save-as). Conversion makes parsing simpler.
- Embedded objects (e.g., embedded emails in Outlook items, text inside images) require special handling:
- For images: OCR (Tesseract or cloud OCR).
- For embedded Outlook items: save them out then parse with appropriate tools.
Cleaning, validating, and deduplicating
- Use a consistent regex for extraction; consider edge cases (subdomains, plus addressing, internationalized domains).
- Normalize (lowercase) addresses before deduplication.
- Validate syntax with stricter patterns or libraries.
- Verify deliverability with SMTP checks or validation services (respect rate limits and legal constraints).
- Filter out role addresses (info@, postmaster@) if needed.
Sample Python script (full example)
# save as extract_emails_from_docx.py from docx import Document import re, os, csv email_re = re.compile(r'[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+.[A-Za-z]{2,}', re.I) def extract_from_docx(path): doc = Document(path) texts = [] for p in doc.paragraphs: texts.append(p.text) for table in doc.tables: for row in table.rows: for cell in row.cells: texts.append(cell.text) # headers/footers/comments require opening package with zipfile + parsing xml if needed matches = set() for t in texts: matches.update(email_re.findall(t)) return matches def main(folder, out_csv='emails.csv'): all_emails = set() for filename in os.listdir(folder): if filename.lower().endswith('.docx'): path = os.path.join(folder, filename) all_emails.update(extract_from_docx(path)) with open(out_csv, 'w', newline='', encoding='utf-8') as f: writer = csv.writer(f) writer.writerow(['email']) for e in sorted(all_emails): writer.writerow([e]) print(f'Found {len(all_emails)} unique emails. Saved to {out_csv}') if __name__ == '__main__': import sys folder = sys.argv[1] if len(sys.argv)>1 else '.' main(folder)
Troubleshooting common issues
- Missing emails from headers/footers/comments: parse the raw XML parts in .docx (word/header.xml, word/footer.xml, word/comments.xml).
- .doc files: convert to .docx or use the pywin32 COM interface on Windows to extract text.
- False positives: refine regex or post-filter domains.
- Encoding problems: ensure UTF-8 handling for output.
Responsible use and next steps
- Store extracted emails securely (encrypted storage if sensitive).
- Respect unsubscribe and privacy laws when contacting.
- Consider enriching lists with consented sources or opt-in methods rather than mass-scraping.
If you’d like, I can:
- Provide a ready-to-run PowerShell script that extracts from .docx and headers/footers.
- Expand the Python example to parse headers/footers/comments and .doc via conversion.
- Create a VBA macro to run inside Word and export to CSV.
Leave a Reply