Clean Extra Spaces from Copied Text
Updated: May 2026
Copying text from Word, PDF readers, spreadsheets, or web pages almost always introduces invisible formatting artifacts — extra spaces, non-breaking spaces, paragraph padding, and tab characters that were never visible in the original document but survive the copy-paste into plain text environments.
Free · No upload · Works for any source
Why copying introduces extra spaces
Rich-text formats like Word, PDF, and HTML represent spacing in multiple ways. Word uses paragraph spacing properties, font-level kerning, and justification algorithms that are invisible as formatting attributes. When you copy plain text from Word and paste it into a plain-text field, the application must convert all of this invisible formatting into explicit characters — and the conversion is lossy in both directions. Some spacing is lost; some is over-represented as literal space characters.
PDFs do not store text as a linear stream. Each character has an absolute X/Y position on the page. When a PDF viewer extracts text for copying, it must infer word boundaries from the gaps between character positions. A gap slightly wider than the font's natural space width is interpreted as two spaces. Across a full-page PDF, hundreds of these ambiguous gaps produce hundreds of double spaces in the extracted text.
Non-breaking spaces (Unicode U+00A0) are common in text copied from web pages. They look like regular spaces but don't behave like them — they don't wrap at line boundaries, don't collapse in HTML, and don't match a simple space in regular expressions. This tool converts them to standard spaces as part of the collapse operation.
Artifacts by source type
- Microsoft Word — double spaces after periods (typing habit or autocorrect), non-breaking spaces in citations, tab characters between columns in manually-formatted tables, and paragraph marks that become blank lines.
- Google Docs — non-breaking spaces inserted by smart paste, extra spaces around hyperlinks, trailing spaces on lines with right-aligned content.
- PDF documents — double spaces from word-gap interpretation, hyphenated words split across lines appearing as "hyp- henated" with a space, and page numbers interrupting paragraphs.
- Spreadsheets — trailing spaces from fixed-width column alignment, tab characters as cell separators, leading spaces in cells formatted as text.
- Web pages — non-breaking spaces used for layout, multiple spaces used as visual indentation in HTML source that survived rendering, and spaces around inline code elements.
Recommended cleaning options by source
For text copied from Word or Google Docs: enable Collapse multiple spaces and Trim each line. This handles the most common artifacts. Enable Limit blank lines if the source had many section breaks.
For text extracted from PDFs: also enable Collapse multiple spaces (essential for double-space gaps). If paragraph breaks were lost during extraction, the text will appear as one long run — use the blank line removal option after manually re-adding paragraph breaks.
For data from spreadsheets: enable Trim each line to remove padding. If tabs separate columns and you want them as spaces, enable Convert tabs to spaces. If you are importing into a database, trimming is the single most important step.
Frequently asked questions
Why does my Word document paste with so many extra spaces?
Word uses non-breaking spaces for certain typographic cases (before colons in French, inside citations, around numbers with units) and may also insert double spaces after periods via autocorrect. The pasted text preserves these as literal characters.
Can this tool fix hyphenated words broken across PDF lines?
No — rejoining hyphenated words requires understanding which hyphens are soft (line-break) and which are hard (part of the word). This tool only handles whitespace. Manual editing is needed for hyphenation artifacts.
My copied text has symbols like � instead of real characters. Can this fix that?
No — those are encoding errors (typically Latin-1 text interpreted as UTF-8). Whitespace cleaning does not address character encoding. Use a dedicated encoding converter before cleaning spaces.