Discussion:
[Opendnssec-user] kaspConnect() causes ods-enforcer brittleness
Havard Eidnes
2015-12-08 08:22:21 UTC
Permalink
Hi,

the ods-enforcer daemon loops, periodically looking for things
which need to be done / scheduled. Inside its service loop it
has this piece of code:

log_msg(config, LOG_INFO, "Connecting to Database...");
kaspConnect(config, &dbhandle);

followed by misleadingly-indented code (that's a minor issue).

However... What happens if an administrator just happens to run
"ods-ksmutil key list ..." at exactly that time when the enforcer
wants to connect to the database? This is what the log can tell:

Dec 8 08:32:43 hugin ods-enforcerd: Connecting to Database...
Dec 8 08:32:43 hugin ods-enforcerd: ERROR: error executing SQL - database is locked

and this time it looks like ods-enforcerd really means that this
is an error, because kaspConnect() doesn't return an error but
does an exit() if connecting to the database fails:

/*
* Connect to the DB
*/
void
kaspConnect(DAEMONCONFIG* config, DB_HANDLE *handle)
{
/* Note that all these XML derived strings are unsigned chars */
if (DbConnect(handle, (char *)config->schema, (char *)config->host, (char *)config->password, (char *)config->user, (char *)config->port) != 0) {
unlink(config->pidfile);
exit(-1);
}

}

That's not a good way to design a library function interface!

This makes ods-enforcer brittle, provides no automatic error
recovery, and requires operator intervention to restart the
now-dead enforcer.

Again, this is with OpenDNSSEC 1.4.7, using sqlite3.

Regards,

- Håvard
Petr Spacek
2015-12-09 10:30:27 UTC
Permalink
Post by Havard Eidnes
Hi,
the ods-enforcer daemon loops, periodically looking for things
which need to be done / scheduled. Inside its service loop it
log_msg(config, LOG_INFO, "Connecting to Database...");
kaspConnect(config, &dbhandle);
followed by misleadingly-indented code (that's a minor issue).
However... What happens if an administrator just happens to run
"ods-ksmutil key list ..." at exactly that time when the enforcer
Dec 8 08:32:43 hugin ods-enforcerd: Connecting to Database...
Dec 8 08:32:43 hugin ods-enforcerd: ERROR: error executing SQL - database is locked
and this time it looks like ods-enforcerd really means that this
is an error, because kaspConnect() doesn't return an error but
/*
* Connect to the DB
*/
void
kaspConnect(DAEMONCONFIG* config, DB_HANDLE *handle)
{
/* Note that all these XML derived strings are unsigned chars */
if (DbConnect(handle, (char *)config->schema, (char *)config->host, (char *)config->password, (char *)config->user, (char *)config->port) != 0) {
unlink(config->pidfile);
exit(-1);
}
}
That's not a good way to design a library function interface!
This makes ods-enforcer brittle, provides no automatic error
recovery, and requires operator intervention to restart the
now-dead enforcer.
Again, this is with OpenDNSSEC 1.4.7, using sqlite3.
I confirm that I have seen the very same issue.

BTW what is the purpose of home-grown lock around SQL database? I did not dig
into it but I would somehow expect that SQL database can deal with access from
multiple processes...
--
Petr Spacek @ Red Hat
Siôn Lloyd
2015-12-09 11:58:03 UTC
Permalink
Post by Petr Spacek
BTW what is the purpose of home-grown lock around SQL database? I did not dig
into it but I would somehow expect that SQL database can deal with access from
multiple processes...
It may no longer be needed; however when we started the project we could
not guarantee that SQLite would be built with concurrent user support.
On at least one of our target platforms (probably solaris) the OS
supplied default was not.

Sion
Berry A.W. van Halderen
2015-12-09 11:59:48 UTC
Permalink
Post by Petr Spacek
Post by Havard Eidnes
This makes ods-enforcer brittle, provides no automatic error
recovery, and requires operator intervention to restart the
now-dead enforcer.
Again, this is with OpenDNSSEC 1.4.7, using sqlite3.
I confirm that I have seen the very same issue.
BTW what is the purpose of home-grown lock around SQL database? I did not dig
into it but I would somehow expect that SQL database can deal with access from
multiple processes...
The database layer is being revamped in 2.0 (which is coming up soon
enough now) and will probably will undergo more changes. Hard to dig
through all the historical decisions, but I know there was a wish not
to be dependent on the database offering transactions or concurrent
access.
Also with the DOA layer in place the locks could be necessary.

This is something that could very well be reconsidered (or should IMO).

With kind regards,
Berry van Halderen
Petr Spacek
2015-12-14 13:32:02 UTC
Permalink
Post by Berry A.W. van Halderen
Post by Petr Spacek
Post by Havard Eidnes
This makes ods-enforcer brittle, provides no automatic error
recovery, and requires operator intervention to restart the
now-dead enforcer.
Again, this is with OpenDNSSEC 1.4.7, using sqlite3.
I confirm that I have seen the very same issue.
BTW what is the purpose of home-grown lock around SQL database? I did not dig
into it but I would somehow expect that SQL database can deal with access from
multiple processes...
The database layer is being revamped in 2.0 (which is coming up soon
enough now) and will probably will undergo more changes. Hard to dig
through all the historical decisions, but I know there was a wish not
to be dependent on the database offering transactions or concurrent
access.
Also with the DOA layer in place the locks could be necessary.
This is something that could very well be reconsidered (or should IMO).
I would like to ask you for reconsideration, pretty please :-)

This is really causing operational problems in our environment.
--
Petr Spacek @ Red Hat
Havard Eidnes
2016-03-18 08:36:56 UTC
Permalink
Post by Havard Eidnes
the ods-enforcer daemon loops, periodically looking for things
which need to be done / scheduled. Inside its service loop it
log_msg(config, LOG_INFO, "Connecting to Database...");
kaspConnect(config, &dbhandle);
followed by misleadingly-indented code (that's a minor issue).
However... What happens if an administrator just happens to run
"ods-ksmutil key list ..." at exactly that time when the enforcer
Dec 8 08:32:43 hugin ods-enforcerd: Connecting to Database...
Dec 8 08:32:43 hugin ods-enforcerd: ERROR: error executing SQL - database is locked
and this time it looks like ods-enforcerd really means that this
is an error, because kaspConnect() doesn't return an error but
Another variant of this problem of database locking just happened
to me. On the hourly run, the enforcer logged:

Mar 18 07:30:03 hugin ods-enforcerd: HSM connection open.
Mar 18 07:30:03 hugin ods-enforcerd: Reading config "/usr/pkg/etc/opendnssec/conf.xml"
Mar 18 07:30:03 hugin ods-enforcerd: Reading config schema "/usr/pkg/share/opendnssec/conf.rng"
Mar 18 07:30:03 hugin ods-enforcerd: Communication Interval: 3600
Mar 18 07:30:03 hugin ods-enforcerd: No DS Submit command supplied
Mar 18 07:30:03 hugin ods-enforcerd: SQLite database set to: /var/db/opendnssec/kasp.db
Mar 18 07:30:03 hugin ods-enforcerd: Log User set to: local0
Mar 18 07:30:03 hugin ods-enforcerd: Switched log facility to: local0
Mar 18 07:30:03 hugin ods-enforcerd: Connecting to Database...
Mar 18 07:30:03 hugin ods-enforcerd: Policy default found.
Mar 18 07:30:03 hugin ods-enforcerd: Key sharing is Off.
Mar 18 07:30:03 hugin ods-enforcerd: 369 zone(s) found on policy "default"
Mar 18 07:30:04 hugin ods-enforcerd: Predict we need 369 KSKs
Mar 18 07:30:04 hugin ods-enforcerd: Have 369 KSK keys in queue
Mar 18 07:30:04 hugin ods-enforcerd: Need 0 new KSK keys
Mar 18 07:30:04 hugin ods-enforcerd: No new KSKs need to be created.
Mar 18 07:30:04 hugin ods-enforcerd: Predict we need 369 new ZSK keys
Mar 18 07:30:04 hugin ods-enforcerd: Have 369 ZSK keys in queue
Mar 18 07:30:04 hugin ods-enforcerd: Need 0 new ZSK keys
Mar 18 07:30:04 hugin ods-enforcerd: No new ZSKs need to be created.
Mar 18 07:30:04 hugin ods-enforcerd: Purging keys...
Mar 18 07:30:04 hugin ods-enforcerd: zonelist filename set to /usr/pkg/etc/opendnssec/zonelist.xml.
Mar 18 07:30:04 hugin ods-enforcerd: Zone 156.193.in-addr.arpa found.
Mar 18 07:30:04 hugin ods-enforcerd: Policy for 156.193.in-addr.arpa set to default.
Mar 18 07:30:04 hugin ods-enforcerd: Config will be output to /var/opendnssec/signconf/156.193.in-addr.arpa.xml.
Mar 18 07:30:04 hugin ods-enforcerd: No change to: /var/opendnssec/signconf/156.193.in-addr.arpa.xml
Mar 18 07:30:04 hugin ods-enforcerd: Zone 157.193.in-addr.arpa found.
Mar 18 07:30:04 hugin ods-enforcerd: Policy for 157.193.in-addr.arpa set to default.
Mar 18 07:30:04 hugin ods-enforcerd: Config will be output to /var/opendnssec/signconf/157.193.in-addr.arpa.xml.
Mar 18 07:30:04 hugin ods-enforcerd: ERROR: error executing SQL - database is locked
Mar 18 07:30:04 hugin ods-enforcerd: No change to: /var/opendnssec/signconf/157.193.in-addr.arpa.xml

and then subseqently apparently for every zone:

Mar 18 07:30:04 hugin ods-enforcerd: Zone 91.78.in-addr.arpa found.
Mar 18 07:30:04 hugin ods-enforcerd: Policy for 91.78.in-addr.arpa set to default.
Mar 18 07:30:04 hugin ods-enforcerd: Config will be output to /var/opendnssec/signconf/91.78.in-addr.arpa.xml.
Mar 18 07:30:04 hugin ods-enforcerd: ERROR: error executing SQL - cannot start a transaction within a transaction
Mar 18 07:30:04 hugin ods-enforcerd: ERROR: database operation failed - cannot start a transaction within a transaction
Mar 18 07:30:04 hugin ods-enforcerd: KsmRequestKeys returned: 71691
Mar 18 07:30:04 hugin ods-enforcerd: Signconf not written for 91.78.in-addr.arpa

and then on the next hourly run it appears to have stumbled in
its own feet:

Mar 18 08:32:11 hugin ods-enforcerd: HSM connection open.
Mar 18 08:32:11 hugin ods-enforcerd: Reading config "/usr/pkg/etc/opendnssec/con
f.xml"
Mar 18 08:32:11 hugin ods-enforcerd: Reading config schema "/usr/pkg/share/opend
nssec/conf.rng"
Mar 18 08:32:11 hugin ods-enforcerd: Communication Interval: 3600
Mar 18 08:32:11 hugin ods-enforcerd: No DS Submit command supplied
Mar 18 08:32:11 hugin ods-enforcerd: SQLite database set to: /var/db/opendnssec/
kasp.db
Mar 18 08:32:11 hugin ods-enforcerd: Log User set to: local0
Mar 18 08:32:11 hugin ods-enforcerd: Switched log facility to: local0
Mar 18 08:32:11 hugin ods-enforcerd: Connecting to Database...
Mar 18 08:32:11 hugin ods-enforcerd: ERROR: error executing SQL - database is lo
cked

which is an instance of the problem quoted above, and which will
cause the enforcer to exit(-1).

I suspect once again that what caused this problem was that one
of my scripts which runs periodically e.g. to list all the zones
handled by OpenDNSSEC via "ods-ksmutil zone list" ran at about
the same time, causing the initial problem, because the script
which ran from cron produced this output, submitted at 07:31:00:

'ods-ksmutil zone list' failed, retrying
'ods-ksmutil zone list' failed, retrying
ERROR: error executing SQL - database is locked
'ods-ksmutil zone list' failed, retrying
ERROR: error executing SQL - database is locked
'ods-ksmutil zone list' failed, retrying
ERROR: error executing SQL - database is locked
'ods-ksmutil zone list' failed, retrying
ERROR: error executing SQL - database is locked
'ods-ksmutil zone list' failed, giving up

I'm not sure what caused the first ods-ksmutil failure in this
script, there's apparently no error message attached in the above
output, although the "sh -x" output from the script lists all 6
identical error messages.

Just restarting the enforcer appears to at least temporarily have
restored my OpenDNSSEC installation to working order.

Regards,

- Håvard

Loading...