Simon Michael
2018-05-21 21:06:20 UTC
At https://github.com/simonmichael/hledger/issues/698 <https://github.com/simonmichael/hledger/issues/698> , https://github.com/simonmichael/hledger/issues/688 <https://github.com/simonmichael/hledger/issues/688> and similar we are discussing issues around number parsing in journal format. Things which seem simple on the surface are proving quite slippery and intertwined. Some decisions are needed, so here are some thoughts.
Some goals:
Simplicity: minimise complexity in implementation, usage, learning.
I18n (internationalisation): support all major local number notations equally well. Don't be us/anglo-centric.
Correctness: minimise the chance of overlooked wrong results due to variations in number format. Assist with error detection as much as possible, at least in strict mode.
Predictability: be intuitive and avoid surprising users. Produce the same results from the same data every time, independent of file ordering, current locale, etc.
Flexibility: allow full control of how numbers are parsed (ie: choice of decimal point character), eg per commodity and per file.
Convenience: minimise boilerplate and configuration effort for common cases. Do the most right thing by default.
On i18n and flexibility:
When displaying output, we want to be able to control the decimal point character (at least period and comma) and digit grouping (none, comma, period, space, arbitrary group sizes..), for i18n reasons. People use different notations in different parts of the world. (https://en.wikipedia.org/wiki/Decimal_separator <https://en.wikipedia.org/wiki/Decimal_separator>)
Currently this is possible on a per-commodity basis. Eg you can display one commodity with the symbol on the left and no decimal places; and another with the symbol on the right and 4 decimal places. This is useful.
You can even display one commodity with a comma decimal point and no digit groups, and another with a period decimal point and comma-separated lakhs/crores/thousand groups. This might useful, neutral or anti-useful, I'm not sure. It allows for mean mixed decimal point/digit grouping conventions in the output of a single report.
When parsing, we need to know the identity of the decimal point character used in the data (period or comma), in order to correctly parse ambiguous numbers like 1,000 or 1.000. This is independent of the decimal point character in output. I think this should be allowed to vary from file to file (maybe you are aggregating data from multiple countries.)
We have been trying to express all of the above (plus the existence of commodities) with commodity (and D) directives, but in their present form I think they are not quite expressive enough.
Streaming ?
A somewhat related topic is, do we care about parsing data in a streaming fashion, ie doing as much work as possible as we go, rather than accumulating it till the end ? I am inclined to think we should prefer streaming-friendly designs and implementations when we have a choice. This is more possible when directives and journal entries are streaming-friendly.
Expanding on the discussion at http://hledger.org/manual.html#directive-scope-multiple-files <http://hledger.org/manual.html#directive-scope-multiple-files>, here are some different kinds of directive or configuration we could name:
- "file-global": the directive affects the whole file it appears in, regardless of its position in the file. Also any included subfiles. (I'll consider "file" to mean "file and its subfiles".)
- "file-chronological": the directive has a date and affects the journal entries in this file (and subfiles) that are dated after it, regardless of its position.
- "file-sequential": the directive affects the part of the file it appears in (and subfiles) following it in the parse sequence, ie below it, until the end of that file. "file-delimited" is a sub-case, where the directive may have a matching end directive to end its effect before the end of the file.
- "global", "chronological", "sequential": corresponding to the above, but not affected by file boundaries; they could affect all files, or all subsequent files.
You can find various examples of these in the ledger-likes. hledger transactions and postings are chronological. Our balance assertions are chronological plus sequential (for assertions on the same date). Account aliases and the comment directive are file-delimited. Our commodity directive is file-global (I *think*). Sequential directives are streaming-friendly.
A sequential commodity directive ?
I had some thoughts about making the commodity directive file-sequential, but I have temporarily forgotten the reasons for. So I'll share the above and think on it again.
Some goals:
Simplicity: minimise complexity in implementation, usage, learning.
I18n (internationalisation): support all major local number notations equally well. Don't be us/anglo-centric.
Correctness: minimise the chance of overlooked wrong results due to variations in number format. Assist with error detection as much as possible, at least in strict mode.
Predictability: be intuitive and avoid surprising users. Produce the same results from the same data every time, independent of file ordering, current locale, etc.
Flexibility: allow full control of how numbers are parsed (ie: choice of decimal point character), eg per commodity and per file.
Convenience: minimise boilerplate and configuration effort for common cases. Do the most right thing by default.
On i18n and flexibility:
When displaying output, we want to be able to control the decimal point character (at least period and comma) and digit grouping (none, comma, period, space, arbitrary group sizes..), for i18n reasons. People use different notations in different parts of the world. (https://en.wikipedia.org/wiki/Decimal_separator <https://en.wikipedia.org/wiki/Decimal_separator>)
Currently this is possible on a per-commodity basis. Eg you can display one commodity with the symbol on the left and no decimal places; and another with the symbol on the right and 4 decimal places. This is useful.
You can even display one commodity with a comma decimal point and no digit groups, and another with a period decimal point and comma-separated lakhs/crores/thousand groups. This might useful, neutral or anti-useful, I'm not sure. It allows for mean mixed decimal point/digit grouping conventions in the output of a single report.
When parsing, we need to know the identity of the decimal point character used in the data (period or comma), in order to correctly parse ambiguous numbers like 1,000 or 1.000. This is independent of the decimal point character in output. I think this should be allowed to vary from file to file (maybe you are aggregating data from multiple countries.)
We have been trying to express all of the above (plus the existence of commodities) with commodity (and D) directives, but in their present form I think they are not quite expressive enough.
Streaming ?
A somewhat related topic is, do we care about parsing data in a streaming fashion, ie doing as much work as possible as we go, rather than accumulating it till the end ? I am inclined to think we should prefer streaming-friendly designs and implementations when we have a choice. This is more possible when directives and journal entries are streaming-friendly.
Expanding on the discussion at http://hledger.org/manual.html#directive-scope-multiple-files <http://hledger.org/manual.html#directive-scope-multiple-files>, here are some different kinds of directive or configuration we could name:
- "file-global": the directive affects the whole file it appears in, regardless of its position in the file. Also any included subfiles. (I'll consider "file" to mean "file and its subfiles".)
- "file-chronological": the directive has a date and affects the journal entries in this file (and subfiles) that are dated after it, regardless of its position.
- "file-sequential": the directive affects the part of the file it appears in (and subfiles) following it in the parse sequence, ie below it, until the end of that file. "file-delimited" is a sub-case, where the directive may have a matching end directive to end its effect before the end of the file.
- "global", "chronological", "sequential": corresponding to the above, but not affected by file boundaries; they could affect all files, or all subsequent files.
You can find various examples of these in the ledger-likes. hledger transactions and postings are chronological. Our balance assertions are chronological plus sequential (for assertions on the same date). Account aliases and the comment directive are file-delimited. Our commodity directive is file-global (I *think*). Sequential directives are streaming-friendly.
A sequential commodity directive ?
I had some thoughts about making the commodity directive file-sequential, but I have temporarily forgotten the reasons for. So I'll share the above and think on it again.
--
You received this message because you are subscribed to the Google Groups "hledger" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hledger+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
You received this message because you are subscribed to the Google Groups "hledger" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hledger+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.