Toolverse

CSV Parsing Done Right: RFC 4180, Edge Cases, and Why Split-on-Comma Fails

7 min read

CSV looks trivial. Comma-separated values, one row per line, done. Then a customer uploads a file with a name like “Smith, Jr.” or a product description with an embedded newline, and your naive split-on-comma parser silently corrupts half the data. CSV only looks simple — getting it right means respecting RFC 4180 and a pile of real-world quirks the RFC does not cover.

What RFC 4180 Actually Specifies

RFC 4180, published in 2005, is the closest thing CSV has to a standard. It defines three core rules:

  • Fields are separated by commas. The record separator is CRLF. Trailing CRLF on the last record is optional.
  • Fields may be enclosed in double quotes. If a field contains a comma, a CRLF, or a double quote, it must be quoted.
  • Double quotes inside a quoted field are escaped by doubling them. So "She said ""hi""" decodes to She said "hi".

That's it. The RFC says nothing about delimiters other than comma, nothing about encoding, nothing about how to handle rows of uneven length. Every real-world CSV consumer has to decide.

Edge Cases That Break Naive Parsers

  • Quoted fields with embedded newlines. "Hello\nWorld" is a single field spanning two text lines. Splitting on \n before respecting quotes turns one row into two.
  • Quotes inside unquoted fields. RFC 4180 says the double quote inside an unquoted field is undefined, but Excel and Google Sheets both emit them literally. A strict parser will reject what most tools produce; a permissive parser passes the raw bytes through.
  • Whitespace around fields. "Ada" , 36 — is the whitespace part of the value, or is it a courtesy? The RFC leaves this ambiguous. Spreadsheet exports usually strip it; programmatic CSVs may not.
  • Byte-order marks. Excel on Windows prepends a UTF-8 BOM (EF BB BF) to exported CSVs. If you do not strip it, the first column name becomes id and every keyed lookup silently fails.
  • Trailing commas on empty rows. Some tools emit a,,,\n to preserve column count; others emit a\n. Parsers must decide whether a short row means nulls or missing columns.

Delimiters: Comma Is a Polite Default, Not a Rule

European locales export CSVs with semicolons, because the comma is the decimal separator. TSV files use tab. PostgreSQL's COPY command defaults to tab. Excel on a German machine produces semicolons. A CSV library that assumes commas will mangle any of these on the first try.

Good parsers either expose a delimiter option or auto-detect by sniffing the first line. Python's csv.Sniffer scans the sample for the most frequent candidate among , ; \t | and picks the one with the most consistent row widths — a practical heuristic that handles 95% of real inputs.

Type Coercion Is a Trap — Usually

Most CSV-to-JSON tools offer to convert 42 to a number and true to a boolean. This helps for clean data and destroys specific columns. Leading-zero identifiers (zip codes, phone numbers, product SKUs) become truncated integers. 007 is a number now, matching 7, not the zero-padded string you needed. Dates like 03/04/2026 look like a division expression.

The rule: coerce only when every cell in the column passes the same type test, and always expose a toggle to keep everything as strings. JSON Schema validators downstream can re-coerce with stricter rules than a generic “looks like a number” regex.

Encoding: The Silent Data Destroyer

CSV has no self-describing encoding. A file that was UTF-8 when exported from PostgreSQL becomes Windows-1252 after a round trip through Excel, which then re-saves it as UTF-8 with a BOM. Characters like é and degrade to mojibake (é, €) at each conversion step.

When receiving CSV from an unknown source, always try UTF-8 first, fall back to Latin-1 (which always succeeds but may produce garbled accents), and flag non-ASCII bytes. Better: require the sender to confirm the encoding out of band, or convert through a format with explicit encoding metadata like Parquet.

When CSV Is and Is Not the Right Format

CSV shines for tabular data that humans will inspect in a spreadsheet — flat, narrow, predictable columns. It is excellent for data interchange between heterogeneous systems precisely because every tool reads it, even if imperfectly.

CSV fails for nested data (no arrays or objects), binary payloads (needs Base64 encoding), and streams where schema evolves. For those, JSON Lines, Parquet, or Arrow are better. A rule of thumb: if your data fits cleanly in a 2D spreadsheet with sub-256 columns, CSV is fine. Beyond that, use JSON.

Key Takeaways

  • RFC 4180 covers quotes, escapes, and line endings — nothing else. Delimiter, encoding, and ragged rows are your problem.
  • A compliant parser handles quoted newlines, doubled quotes, and the UTF-8 BOM. A non-compliant one silently corrupts data.
  • Type coercion is optional — always expose a toggle so leading-zero IDs and phone numbers survive.
  • When you ship CSV to a partner, document the encoding, delimiter, and line-ending conventions. Assume nothing will be guessed correctly.

Need to convert a pasted CSV to JSON right now, with proper quoting and optional type coercion? Try our CSV to JSON converter — runs in your browser, no upload.

Try it yourself

Put what you learned into practice with our free tool.

Open Tool

Frequently Asked Questions

What does RFC 4180 actually standardize?
RFC 4180 defines only three things: comma as the field separator, CRLF as the record separator, and double-quote escaping via doubled quotes. It does not standardize delimiters other than comma, character encoding, or how to handle rows of uneven length — those decisions are left to each implementation.
Why does my CSV lose leading zeros when I convert it to JSON?
Aggressive type coercion converts strings that look like numbers into integers, so zip code '03401' becomes 3401 and SKU '007' becomes 7. Any production-grade CSV tool should let you disable coercion or toggle it per-column to preserve leading-zero identifiers.
Is CSV still appropriate in 2026?
Yes, for flat tabular data exchanged between heterogeneous systems. CSV is universally readable by spreadsheets and databases. It breaks down for nested structures, binary payloads, and evolving schemas — for those, JSON Lines or Parquet are better choices. Use CSV when your data fits a 2D table with stable columns.