Blame view

vendor/ezyang/htmlpurifier/docs/proposal-errors.txt 9.75 KB
abf1649b   andryeyev   Чистая установка ...
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
  Considerations for ErrorCollection
  
  Presently, HTML Purifier takes a code-execution centric approach to handling
  errors. Errors are organized and grouped according to which segment of the
  code triggers them, not necessarily the portion of the input document that
  triggered the error. This means that errors are pseudo-sorted by category,
  rather than location in the document.
  
  One easy way to "fix" this problem would be to re-sort according to line number.
  However, the "category" style information we derive from naively following
  program execution is still useful. After all, each of the strategies which
  can report errors still process the document mostly linearly. Furthermore,
  not only do they process linearly, but the way they pass off operations to
  sub-systems mirrors that of the document. For example, AttrValidator will
  linearly proceed through elements, and on each element will use AttrDef to
  validate those contents. From there, the attribute might have more
  sub-components, which have execution passed off accordingly.
  
  In fact, each strategy handles a very specific class of "error."
  
  RemoveForeignElements   - element tokens
  MakeWellFormed          - element token ordering
  FixNesting              - element token ordering
  ValidateAttributes      - attributes of elements
  
  The crucial point is that while we care about the hierarchy governing these
  different errors, we *don't* care about any other information about what actually
  happens to the elements. This brings up another point: if HTML Purifier fixes
  something, this is not really a notice/warning/error; it's really a suggestion
  of a way to fix the aforementioned defects.
  
  In short, the refactoring to take this into account kinda sucks.
  
  Errors should not be recorded in order that they are reported. Instead, they
  should be bound to the line (and preferably element) in which they were found.
  This means we need some way to uniquely identify every element in the document,
  which doesn't presently exist. An easy way of adding this would be to track
  line columns. An important ramification of this is that we *must* use the
  DirectLex implementation.
  
      1. Implement column numbers for DirectLex [DONE!]
      2. Disable error collection when not using DirectLex [DONE!]
  
  Next, we need to re-orient all of the error declarations to place CurrentToken
  at utmost important. Since this is passed via Context, it's not always clear
  if that's available. ErrorCollector should complain HARD if it isn't available.
  There are some locations when we don't have a token available. These include:
  
      * Lexing - this can actually have a row and column, but NOT correspond to
        a token
      * End of document errors - bump this to the end
  
  Actually, we *don't* have to complain if CurrentToken isn't available; we just
  set it as a document-wide error. And actually, nothing needs to be done here.
  
  Something interesting to consider is whether or not we care about the locations
  of attributes and CSS properties, i.e. the sub-objects that compose these things.
  In terms of consistency, at the very least attributes should have column/line
  numbers attached to them. However, this may be overkill, as attributes are
  uniquely identifiable. You could go even further, with CSS, but they are also
  uniquely identifiable.
  
  Bottom-line is, however, this information must be available, in form of the
  CurrentAttribute and CurrentCssProperty (theoretical) context variables, and
  it must be used to organize the errors that the sub-processes may throw.
  There is also a hierarchy of sorts that may make merging this into one context
  variable more sense, if it hadn't been for HTML's reasonably rigid structure.
  A CSS property will never contain an HTML attribute. So we won't ever get
  recursive relations, and having multiple depths won't ever make sense. Leave
  this be.
  
  We already have this information, and consequently, using start and end is
  *unnecessary*, so long as the context variables are set appropriately. We don't
  care if an error was thrown by an attribute transform or an attribute definition;
  to the end user these are the same (for a developer, they are different, but
  they're better off with a stack trace (which we should add support for) in such
  cases).
  
      3. Remove start()/end() code. Don't get rid of recursion, though [DONE]
      4. Setup ErrorCollector to use context information to setup hierarchies.
         This may require a different internal format. Use objects if it gets
         complex. [DONE]
  
         ASIDE
              More on this topic: since we are now binding errors to lines
              and columns, a particular error can have three relationships to that
              specific location:
  
              1. The token at that location directly
                  RemoveForeignElements
                  AttrValidator (transforms)
                  MakeWellFormed
              2. A "component" of that token (i.e. attribute)
                  AttrValidator (removals)
              3. A modification to that node (i.e. contents from start to end
                 token) as a whole
                  FixNesting
  
              This needs to be marked accordingly. In the presentation, it might
              make sense keep (3) separate, have (2) a sublist of (1). (1) can
              be a closing tag, in which case (3) makes no sense at all, OR it
              should be related with its opening tag (this may not necessarily
              be possible before MakeWellFormed is run).
  
              So, the line and column counts as our identifier, so:
  
              $errors[$line][$col] = ...
  
              Then, we need to identify case 1, 2 or 3. They are identified as
              such:
  
              1. Need some sort of semaphore in RemoveForeignElements, etc.
              2. If CurrentAttr/CurrentCssProperty is non-null
              3. Default (FixNesting, MakeWellFormed)
  
              One consideration about (1) is that it usually is actually a
              (3) modification, but we have no way of knowing about that because
              of various optimizations. However, they can probably be treated
              the same. The other difficulty is that (3) is never a line and
              column; rather, it is a range (i.e. a duple) and telling the user
              the very start of the range may confuse them. For example,
  
              <b>Foo<div>bar</div></b>
              ^     ^
  
              The node being operated on is <b>, so the error would be assigned
              to the first caret, with a "node reorganized" error. Then, the
              ChildDef would have submitted its own suggestions and errors with
              regard to what's going in the internals.  So I suppose this is
              ok. :-)
  
              Now, the structure of the earlier mentioned ... would be something
              like this:
  
              object {
                  type = (token|attr|property),
                  value, // appropriate for type
                  errors => array(),
                  sub-errors = [recursive],
              }
  
              This helps us keep things agnostic. It is also sufficiently complex
              enough to warrant an object.
  
  So, more wanking about the object format is in order. The way HTML Purifier is
  currently setup, the only possible hierarchy is:
  
      token -> attr -> css property
  
  These relations do not exist all of the time; a comment or end token would not
  ever have any attributes, and non-style attributes would never have CSS properties
  associated with them.
  
  I believe that it is worth supporting multiple paths. At some point, we might
  have a hierarchy like:
  
      * -> syntax
        -> token -> attr -> css property
                         -> url
                 -> css stylesheet <style>
  
  et cetera. Now, one of the practical implications of this is that every "node"
  on our tree is well-defined, so in theory it should be possible to either 1.
  create a separate class for each error struct, or 2. embed this information
  directly into HTML Purifier's token stream.  Embedding the information in the
  token stream is not a terribly good idea, since tokens can be removed, etc.
  So that leaves us with 1... and if we use a generic interface we can cut down
  on a lot of code we might need. So let's leave it like this.
  
  ~~~~
  
  Then we setup suggestions.
  
      5. Setup a separate error class which tells the user any modifications
         HTML Purifier made.
  
  Some information about this:
  
  Our current paradigm is to tell the user what HTML Purifier did to the HTML.
  This is the most natural mode of operation, since that's what HTML Purifier
  is all about; it was not meant to be a validator.
  
  However, most other people have experience dealing with a validator. In cases
  where HTML Purifier unambiguously does the right thing, simply giving the user
  the correct version isn't a bad idea, but problems arise when:
  
  - The user has such bad HTML we do something odd, when we should have just
    flagged the HTML as an error. Such examples are when we do things like
    remove text from directly inside a <table> tag. It was probably meant to
    be in a <td> tag or be outside the table, but we're not smart enough to
    realize this so we just remove it. In such a case, we should tell the user
    that there was foreign data in the table, but then we shouldn't "demand"
    the user remove the data; it's more of a "here's a possible way of
    rectifying the problem"
  
  - Giving line context for input is hard enough, but feasible; giving output
    line context will be extremely difficult due to shifting lines; we'd probably
    have to track what the tokens are and then find the appropriate out context
    and it's not guaranteed to work etc etc etc.
  
  ````````````
  
  Don't forget to spruce up output.
  
      6. Output needs to automatically give line and column numbers, basically
         "at line" on steroids. Look at W3C's output; it's ok. [PARTIALLY DONE]
  
         - We need a standard CSS to apply (check demo.css for some starting
           styling; some buttons would also be hip)
  
      vim: et sw=4 sts=4