decimal point character, commodity directives, parsing, streaming

Discussion:

Simon Michael

2018-05-21 21:06:20 UTC

At https://github.com/simonmichael/hledger/issues/698 <https://github.com/simonmichael/hledger/issues/698> , https://github.com/simonmichael/hledger/issues/688 <https://github.com/simonmichael/hledger/issues/688> and similar we are discussing issues around number parsing in journal format. Things which seem simple on the surface are proving quite slippery and intertwined. Some decisions are needed, so here are some thoughts.

Some goals:

Simplicity: minimise complexity in implementation, usage, learning.

I18n (internationalisation): support all major local number notations equally well. Don't be us/anglo-centric.

Correctness: minimise the chance of overlooked wrong results due to variations in number format. Assist with error detection as much as possible, at least in strict mode.

Predictability: be intuitive and avoid surprising users. Produce the same results from the same data every time, independent of file ordering, current locale, etc.

Flexibility: allow full control of how numbers are parsed (ie: choice of decimal point character), eg per commodity and per file.

Convenience: minimise boilerplate and configuration effort for common cases. Do the most right thing by default.

On i18n and flexibility:

When displaying output, we want to be able to control the decimal point character (at least period and comma) and digit grouping (none, comma, period, space, arbitrary group sizes..), for i18n reasons. People use different notations in different parts of the world. (https://en.wikipedia.org/wiki/Decimal_separator <https://en.wikipedia.org/wiki/Decimal_separator>)

Currently this is possible on a per-commodity basis. Eg you can display one commodity with the symbol on the left and no decimal places; and another with the symbol on the right and 4 decimal places. This is useful.

You can even display one commodity with a comma decimal point and no digit groups, and another with a period decimal point and comma-separated lakhs/crores/thousand groups. This might useful, neutral or anti-useful, I'm not sure. It allows for mean mixed decimal point/digit grouping conventions in the output of a single report.

When parsing, we need to know the identity of the decimal point character used in the data (period or comma), in order to correctly parse ambiguous numbers like 1,000 or 1.000. This is independent of the decimal point character in output. I think this should be allowed to vary from file to file (maybe you are aggregating data from multiple countries.)

We have been trying to express all of the above (plus the existence of commodities) with commodity (and D) directives, but in their present form I think they are not quite expressive enough.

Streaming ?

A somewhat related topic is, do we care about parsing data in a streaming fashion, ie doing as much work as possible as we go, rather than accumulating it till the end ? I am inclined to think we should prefer streaming-friendly designs and implementations when we have a choice. This is more possible when directives and journal entries are streaming-friendly.

Expanding on the discussion at http://hledger.org/manual.html#directive-scope-multiple-files <http://hledger.org/manual.html#directive-scope-multiple-files>, here are some different kinds of directive or configuration we could name:

- "file-global": the directive affects the whole file it appears in, regardless of its position in the file. Also any included subfiles. (I'll consider "file" to mean "file and its subfiles".)

- "file-chronological": the directive has a date and affects the journal entries in this file (and subfiles) that are dated after it, regardless of its position.

- "file-sequential": the directive affects the part of the file it appears in (and subfiles) following it in the parse sequence, ie below it, until the end of that file. "file-delimited" is a sub-case, where the directive may have a matching end directive to end its effect before the end of the file.

- "global", "chronological", "sequential": corresponding to the above, but not affected by file boundaries; they could affect all files, or all subsequent files.

You can find various examples of these in the ledger-likes. hledger transactions and postings are chronological. Our balance assertions are chronological plus sequential (for assertions on the same date). Account aliases and the comment directive are file-delimited. Our commodity directive is file-global (I *think*). Sequential directives are streaming-friendly.

A sequential commodity directive ?

I had some thoughts about making the commodity directive file-sequential, but I have temporarily forgotten the reasons for. So I'll share the above and think on it again.

--
You received this message because you are subscribed to the Google Groups "hledger" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hledger+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Simon Michael

2018-05-22 17:30:05 UTC

Permalink

Post by Simon Michael
I had some thoughts about making the commodity directive file-sequential,

Well, never mind about that. More notes, steering back towards https://github.com/simonmichael/hledger/issues/698 <https://github.com/simonmichael/hledger/issues/698>. These are untested thoughts, please excuse/correct any errors.

Local notations, notational scope

Within the journal format, we allow varying local notations for two things: dates and numbers.

We should expect that hledger can reliably parse multiple journal files with different notations, when provided as multiple -f arguments, or when they include each other; and regardless of how the files are ordered. Also we could assume, and should perhaps check where possible and useful, that consistent notation is used within a single file. (At least, that the decimal point character does not change within a file.)

This implies that information about the data format belongs with the data (obviously enough). Specifically, each individual file can have its own notation, eg decimal point character. Included files can have a different notation from their parent.

The decimal point character displayed in output is independent of, and might be different from, the above. It belongs with the commodity, or possibly with the report/command. Ie, regardless of the input data's format, in reports we want to see a consistent notation for each commodity. (Or possibly, for things like decimal point and digit groups, across all commodities.)

Parsing dates reliably

Our date parsing is fairly simple, because we restrict it to one of Y-M-D, Y/M/D, Y.M.D, M-D, M/D, M.D. For the last three we resolve the year using "the nearest applicable date source", which could be a parent transaction, or a currently active Y directive, or the current year at report time. No problems here right now, though https://github.com/simonmichael/hledger/issues/779 <https://github.com/simonmichael/hledger/issues/779> is slightly related.

Parsing numbers reliably

Number (and amount) parsing is complex, but we can pretty well auto-detect everything from the data. Except for one case: given an ambiguous number like 1,000 or 1.000, we can't tell from the local data if that's a decimal point or a digit group. We want to give a correct parse or a parse error, so we need more information from somewhere. Currently, we look for an applicable commodity directive, which will specify the decimal point character. This is working for people so far, but it has two limitations:

- the commodity directive as currently implemented doesn't have the scope we want (ie, local file only)
- it currently doesn't support a different input and output decimal point character

If there isn't a commodity directive, we guess that the number contains a decimal point (regardless of what the other numbers look like), and no error is raised.

Guessing is not good, because it can be wrong and the user is unhappy, complains, and has to learn to add a commodity directive. Or worse, they might not notice.

We could do what a human would do: look at the other amounts in the file for clues. If at least one of them has both a decimal point and a digit group separator, or two or more digit group separators, we'll know which is which. But there's no guarantee; the whole file could be ambiguous. Also this is not streaming-friendly.

We could raise an error in this situation, requiring the user to add a commodity directive. That might not always be possible or convenient (eg read only data). It will break some files that used to work. We could do it by default nevertheless; or only in strict mode.

Simplest thing that could possibly work ?

Well, all this thinking is making me sweat. But it's better than being confused, or coding on into a dead end. But I remind myself that we shouldn't over-engineer, let's be open to simple cheap solutions for our current real-world needs. Which I think are...

- end confusion and issue reports to do with parsing and displaying numbers in various notations
- remove the possibility of unnoticed misparsed numbers

Simon Michael

2018-05-24 18:47:48 UTC

Permalink

Still chewing on this. Remember: parsing a forest of files with multiple number notations. I'll propose some more ideas.

How to remove input decimal point ambiguity ?

Ignoring backward compatibility for a moment,

I don't think there's any good use case for mixing different decimal points within a single file. We should assume and check where possible that in each file a single consistent decimal point is used. (Wikipedia uses "decimal separator"; "decimal mark" is another term; I'll keep calling it "decimal point" for now.)

Then, commodity directives are not the right place to specify the file's decimal point. There could be a separate directive, eg "decimal ." (Though, this naming does not entirely make clear that it describes input, not output).

For maximum explicitness and simplicity, this directive would appear in each file. For easiest parsing, it would always appear at the top, before any amounts.

In certain situations, we could relax the requirement to have it in every file, without causing decimal point ambiguity, depending on how much complication we want. Eg:

a. if there are no ambiguous-decimal-point amounts (like $1,000 or $1.000) in the file

b. if a decimal-point-determining amount (like $1,000,000 or $1,000.00) appears in the file before the first ambiguous-decimal-point amount. (Requires writing digit group separators in at least one amount, probably the first. If you reorder transactions, the file could become ambiguous again.)

c. if a decimal-point-determining amount appears anywhere in the file. (Requires writing digit group separators in at least one amount. Requires delaying the interpretation of numbers until the whole file has been parsed.)

d. if a default decimal point is configured by a --decimal command line option.

e. if a default decimal point is configured by a user config file. (Means the same journal file and command can produce different results across machines.)

f. if the default decimal point is taken from the current system locale. (Means the same journal file and command can produce different results across machines. Requires that a locale is configured.)

g. if we use a hard-coded default decimal point.

I think:

- a, b and c (relying on amount formats) are kind of a pain for users to learn and remember and maintain. Though, if you have a file that is unambiguous, you might resent being forced to add a seemingly-redundant directive to it.

- d (command line option) would be a chore, though perhaps useful with dealing with read-only files.

- e (config file) and f (locale) allow for variance between machines and users, which adds complexity. f would probably be quite convenient and intuitive in most cases.

- g (hard-coded default) is what I'm trying to avoid, as the world seems pretty evenly split on this question (see below). Although if we can find a suitable ISO standard, we could follow that, as we do for dates.

How to deal with unresolvable decimal point ambiguity

We could report an error when any decimal point ambiguity remains. This would break some journal files that used to work fine, requiring the user to tweak their files or config after upgrading, or perhaps when reading Ledger files.

We could make it "work" by making a best guess, like current hledger; but also print a warning when there is ambiguity. I have preferred to avoid warnings so far, but this could be a good place for one. Also, there could be a strict mode in which the warning would be an error.

What decimal separator and digit group separator are preferred where ?

https://en.wikipedia.org/wiki/International_System_of_Units <https://en.wikipedia.org/wiki/International_System_of_Units> has a useful short summary:

The symbol for the decimal marker <https://en.wikipedia.org/wiki/Decimal_marker> is either a point <https://en.wikipedia.org/wiki/Full_stop> or comma <https://en.wikipedia.org/wiki/Comma> on the line. In practice, the decimal point is used in most English-speaking countries and most of Asia, and the comma in most of Latin America <https://en.wikipedia.org/wiki/Latin_America> and in continental European countries <https://en.wikipedia.org/wiki/List_of_sovereign_states_and_dependent_territories_in_Europe>.[18] <https://en.wikipedia.org/wiki/International_System_of_Units#cite_note-24>
Spaces should be used as a thousands separator <https://en.wikipedia.org/wiki/Thousands_separator> (1000000) in contrast to commas or periods (1,000,000 or 1.000.000) to reduce confusion resulting from the variation between these forms in different countries.

So here's a rough mnemonic:

"decimal period is anglo-asian, decimal comma is euro-latin, space separator is preferred to avoid the period/comma issue".

Simon Michael

2018-05-27 15:07:37 UTC

Permalink

Getting a bit more concrete now. Let me know if you agree the problems are real and worth fixing, and if you see problems with the plan. I know there will be corner-cases I haven't imagined. The plan is already quite complex to describe! and I'm a bit concerned this will leak out to users.

problems with amount parsing/rendering (in current master)

<>numbers are parsed loosely, accepting any of two decimal separators
<>we don't warn about inconsistent choice of decimal separator across amounts
<>ambiguous-separator amounts can be silently misparsed (a single digit group separator is interpreted as a decimal separator)
<>commodity directives, the recommended solution, have unclear semantics
<>they are used for: declaring commodity symbols, resolving input decimal separator ambiguity, controlling output style
<>the directive's scope for each of of these is non-obvious. Should subfiles be affected ? other files ? transactions in the same file preceding the directive ?
<>commodity directives used to resolve the input decimal separator also limit output style
<>they force digit group separators to appear in output
<>they force the output decimal separator to match the input
<>D directives do all that commodity directives do and more, adding complexity
<>allowing commodities to have different input decimal separators within a file is excessive flexibility
<>allowing commodities to have different output decimal separator within a report is excessive flexibility
<>lack of clarity makes this annoying to learn, costly to support; ongoing issue reports

goals

<>be convenient and intuitive
<>be i18n-aware
<>detect and report errors, avoid guessing
<>don't require learning detailed rules
<>in every situation, do something that's sensible at least in hindsight
<>keep as much backward and sideways compatibility as possible
<>end confusion and bug reports about basic number parsing and rendering

a short term plan

<>clarify commodity directives' scope, use them for describing both input and output
<>a directive for a commodity sets its input decimal point for the rest of the current file, exclusive of subfiles
<>a directive for a commodity declares its symbol's validity for the rest of the current file, exclusive of subfiles
<>the first directive for a commodity across all files sets its output style in the report
<>refine decimal separator parsing
<>commodities and input decimal separators are always declared together, and per file
<>at the start of each file (and each included file), no commodities or input decimal separators are known
<>parsing a commodity directive
<>declares the commodity and its input decimal separator for the current file
<>the directive's amount must have a separator
<>if the separator is ambiguous is it assumed to be the decimal separator. Could display a warning.
<>if this is the first directive for this commodity among all files: also declares the commodity's output style for the report
<>parsing a definite-separator amount whose commodity is not yet declared, has the same effect as a commodity directive. Could display a warning. In a future strict mode, this will raise an error.
<>parsing an ambiguous-separator amount whose commodity is not yet declared, has the same effect as a commodity directive and also displays a warning (one per file)
<>parsing an amount whose decimal separator is inconsistent with the one declared for its commodity, raises an error
<>
<>clarify docs
<>use commodity directives in each file to help parse it correctly, declaring commodity symbols and the decimal separator used
<>use commodity directives in the first/uppermost file to help control each commodity's output style, declaring the symbol position, digit groups, decimal separator, and number of decimal places
<>simplify D directive if it gives any trouble
<>D only specifies a default symbol, eg: D $ . The old syntax is accepted for backwards compatibility but only the symbol matters.

a medium term plan

<>perhaps add/change directives/options to make things clearer
add strict mode

Simon Michael

2018-05-27 15:23:10 UTC

Permalink

Post by Simon Michael
a medium term plan
<>perhaps add/change directives/options to make things clearer

More detail on this:

- add a decimal or decimal-separator directive to set that once per file. "decimal ,"

- add an alternate simpler form of commodity directive that just defines a symbol. "commodity $". And perhaps allow declaring multiple symbols with one directive.

- use system locale to choose a default output decimal separator

- add an --amount-style command line option that overrides output style, for individual commodities or all commodities.

Simon Michael

2018-06-01 23:56:48 UTC

Permalink

Work will continue on this at https://github.com/simonmichael/hledger/issues/793 .