Bulk Extract E-mails from MS Word Documents for Marketing Lists

from docx import Document import re, os email_re = re.compile(r'[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+.[A-Za-z]{2,}') def extract_from_docx(path):     doc = Document(path)     texts = []     for p in doc.paragraphs:         texts.append(p.text)     for table in doc.tables:         for row in table.rows:             for cell in row.cells:                 texts.append(cell.text)     matches = set()     for t in texts:         matches.update(email_re.findall(t))     return matches 

Pros: Flexible, cross-platform, handles many parts of a .docx. Cons: Additional work for .doc or embedded objects.

Option C — VBA macro inside Word

For users comfortable with Word macros, VBA can loop through open documents or files in a folder and write found addresses to a new document or CSV. VBA can access headers/footers and comments but requires enabling macros.


Extracting from older .doc files and embedded content

  • Convert .doc to .docx first (Word can batch-convert or save-as). Conversion makes parsing simpler.
  • Embedded objects (e.g., embedded emails in Outlook items, text inside images) require special handling:
    • For images: OCR (Tesseract or cloud OCR).
    • For embedded Outlook items: save them out then parse with appropriate tools.

Cleaning, validating, and deduplicating

  1. Use a consistent regex for extraction; consider edge cases (subdomains, plus addressing, internationalized domains).
  2. Normalize (lowercase) addresses before deduplication.
  3. Validate syntax with stricter patterns or libraries.
  4. Verify deliverability with SMTP checks or validation services (respect rate limits and legal constraints).
  5. Filter out role addresses (info@, postmaster@) if needed.

Sample Python script (full example)

# save as extract_emails_from_docx.py from docx import Document import re, os, csv email_re = re.compile(r'[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+.[A-Za-z]{2,}', re.I) def extract_from_docx(path):     doc = Document(path)     texts = []     for p in doc.paragraphs:         texts.append(p.text)     for table in doc.tables:         for row in table.rows:             for cell in row.cells:                 texts.append(cell.text)     # headers/footers/comments require opening package with zipfile + parsing xml if needed     matches = set()     for t in texts:         matches.update(email_re.findall(t))     return matches def main(folder, out_csv='emails.csv'):     all_emails = set()     for filename in os.listdir(folder):         if filename.lower().endswith('.docx'):             path = os.path.join(folder, filename)             all_emails.update(extract_from_docx(path))     with open(out_csv, 'w', newline='', encoding='utf-8') as f:         writer = csv.writer(f)         writer.writerow(['email'])         for e in sorted(all_emails):             writer.writerow([e])     print(f'Found {len(all_emails)} unique emails. Saved to {out_csv}') if __name__ == '__main__':     import sys     folder = sys.argv[1] if len(sys.argv)>1 else '.'     main(folder) 

Troubleshooting common issues

  • Missing emails from headers/footers/comments: parse the raw XML parts in .docx (word/header.xml, word/footer.xml, word/comments.xml).
  • .doc files: convert to .docx or use the pywin32 COM interface on Windows to extract text.
  • False positives: refine regex or post-filter domains.
  • Encoding problems: ensure UTF-8 handling for output.

Responsible use and next steps

  • Store extracted emails securely (encrypted storage if sensitive).
  • Respect unsubscribe and privacy laws when contacting.
  • Consider enriching lists with consented sources or opt-in methods rather than mass-scraping.

If you’d like, I can:

  • Provide a ready-to-run PowerShell script that extracts from .docx and headers/footers.
  • Expand the Python example to parse headers/footers/comments and .doc via conversion.
  • Create a VBA macro to run inside Word and export to CSV.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *