Diamond: develop a tool for finding entries in the schema that are unused
We currently have the diamond_validation test, which checks that every flml file is valid with respect to the current schema. I would also like another complementary tool: given a schema and a set of flml files, which entries in the schema are never used?
The reason for this is that parsing the schema is the most expensive part of diamond's operation. As cruft accumulates in the schema, diamond only gets slower and slower. However, if we can automatically identify schema cruft, then it can be stripped and everyone will be happy.
Blueprint information
- Status:
- Started
- Approver:
- Patrick Farrell
- Priority:
- Undefined
- Drafter:
- Patrick Farrell
- Direction:
- Needs approval
- Assignee:
- Fraser Waters
- Definition:
- New
- Series goal:
- None
- Implementation:
- Beta Available
- Milestone target:
- None
- Started by
- Fraser Waters
- Completed by
Whiteboard
[fwaters]
So this is almost working, but running up against a problem. Its slow, like reallllly slow and I don't think there's a lot we can do about it. The schemas are huge and iterating over the entire thing 3 times (once to find the fullset, once to populate the treeview and once to color it correctly) is going to take a while. Hell iterating once takes a while.
[pefarrell]
Slow means -- an hour? a day? a week?
[fwaters]
About half an hour currently. That's only comparing with one flml file but the actual flml files don't take as long to process so doing more shouldn't significantly slow the process.
[pefarrell]
I don't know exactly what approach you're taking, but it sounds like there's something wrong fundamentally with the approach if it takes that long. Why would registering the used parts of the schema take any longer than reading the flml?
[fwaters]
Reading the flml only has to read the schema as far as it matches the flml. Reading the whole schema means reading in over 2000 (for flml) elements. Just building up the set of paths from that takes about 3 minutes. And we have to iterate over it 3 times which takes about 10 minutes in total. So I guess half an hour was an overestimate but it's not fast.