Quantcast
Channel: ATeam Chronicles
Viewing all 228 articles
Browse latest View live

BPM Auditing Demystified

$
0
0
I've heard from a couple of customers recently asking about BPM audit table growth, specifically BPM_AUDIT_QUERY. It led me to investigate the impact of the various audit levels in SOA/BPM on these table and to propose options to them. It is important to note up-front that BPM is a human-centric workflow application and therefore should be expected to audit often and in detail... the reality is that business users probably will want to know who did what and when, and also who didn't do what when they were supposed to. BPM auditing is very rich and can provide this kind of information and more. The "downside" of this is that audit tables can grow at a faster rate than expected, and BPM_AUDIT_QUERY is normally the most prominent of these. Clearly there are well documented strategies for archiving/purging and partitioning which can control/limit the impact of table growth but there may also be simple changes to the BPM audit settings which can prove beneficial in certain business situations.

Audit Settings

There are essentially three places where the auditing of BPM applications can be controlled....

SOAINFRA Level

BPMA_01 BPMA_02

BPMN Engine Level

BPMA_03 BPMA_04

SOA Composite

. BPMA_05

Audit Comparison

Now we know where to set the audit levels, let's see what impact various combinations have...
  • in terms of the number of rows written to BPM_AUDIT_QUERY & BPM_CUBE_AUDITINSTANCE
  • in terms of what we see in the Enterprise Manager flow trace so we can determine if what is visible is an acceptable compromise

The Project

In order to demonstrate what effect the audit levels have, we'll use a very simple BPM process with one human activity.... BPMA_07 For each test we'll run through a complete instance of this process: initiate it from the Enterprise Manager composite tester & approve the human task inside BPM Workspace. We'll then look at the number of rows added to the tables and also the instance view & flow trace in the Enterprise Manager.

Standard Development Level

For the first test we'll use the out-of-the-box settings shown in the screenshots above....
  • SOAINFRA Level - development
  • BPMN Engine Level - inherit
  • Composite Level - inherit
BPMA_08 ...so 8 rows added to BPM_AUDIT_QUERY, this will be two for each activity (in/out) and two for the scope of the process itself (in/out). Looking at the flow trace.... BPMA_06 ...exactly as we would expect, we can see all activities and the input & output data would be visible for all of them (as they are highlighted in blue and clickable).

Default Production Level

A standard best practice is to "set the audit level to production" we've all read this in numerous documents/blogs/white papers, so let's see what effect this has....
  • SOAINFRA Level - production
  • BPMN Engine Level - inherit
  • Composite Level - inherit
BPMA_09 BPMA_11 ...so, no different to development mode in terms of the number of rows added. What about the Flow Trace.... BPMA_12 ...we can see that visually it is the same, but not all of the data will be visible, only the data associated with the human activity is clickable. So, an interesting result... this change does not seem to have made any positive impact at all on the two tables we're interested in... we'll understand why later.

Composite Level OFF

Let's see what impact turning off auditing at the composite level has....
  • SOAINFRA Level - production
  • BPMN Engine Level - inherit
  • Composite Level - OFF
BPMA_13 BPMA_14 ...so, a reduction in the number of rows written to BPM_AUDIT_QUERY, But what about the flow trace.... BPMA_15 ...we can't even launch it, worse still it doesn’t even show in the list of instances…. BPMA_16 ...and even worse still, this instance will have no composite instance id and is known as "orphaned". This combination is clearly a complete non-starter.

BPMN Engine Level OFF

Let's turn off auditing at the level of the BPMN engine, after turning it back on at the level of the composite....
  • SOAINFRA Level - production
  • BPMN Engine Level - OFF
  • Composite Level - inherit
BPMA_17 BPMA_18 BPMA_19 ...so a reduction in the number of rows written to BPM_AUDIT_QUERY, what about the flow trace this time.... BPMA_20 BPMA_21 ...we just see the human activity, no more, and it is not drillable so we will not be able to see any instance data. A better result but perhaps not ideal.

BPMN Monitors & Sensors Unchecked

  • SOAINFRA Level - production
  • BPMN Engine Level - OFF
  • Composite Level - inherit
  • BPMN Monitors & Sensors - OFF
Let's look at the "help" that appears when we click on the BPMN engine audit level to understand what the options mean... ... BPMA_22 ...i.e. when set to "off", if measurement is enabled then it is overridden to "minimal". What does this mean ? Well, if we're collecting analytic data for our process then the engine will always need to audit regardless, so let's disable the BPMN monitors & sensors... tBPMA_23 BPMA_27 ...no rows added to either table, but at what cost to visibility ? . BPMA_25 ...well, we still see a composite flow trace, but what happens when we drill down to the bpm process ? BPMA_26 ...there's nothing. This may be an acceptable compromise for some composites given the huge impact it has on the BPM audit tables. But.... we've turned auditing off at the level of the BPMN engine, this will affect everything, how can we mange at composite level ?

Composite at Production

Let's try making one change, the composite level from "inherit" to "production"...
  • SOAINFRA Level - production
  • BPMN Engine Level - OFF
  • Composite Level - production
  • BPMN Monitors & Sensors - OFF
BPMA_28  BPMA_31 ...we have the audit rows written again.... BPMA_29 BPMA_30 ...and we have visibility in the Enterprise Manager. So, we have shown that by simply changing the composite audit level between "inherit" & "production" we can effectively turn on & off the auditing at composite level.

Summary

We have seen in this blog the effects of altering the audit levels available to BPM on the BPM audit tables, BPM_AUDIT_QUERY & BPM_CUBE_AUDITINSTANCE, and on the monitoring available within the Enterprise Manager. Bear in mind we have not looked at the impact on process tracking within BPM Workspace, this will be affected. Knowing which levels are appropriate for which composite is a business decision...it's a simple trade-off between visibility and data storage, the more visibility we require the more we have to insert, and these inserts have a dual cost... the cost in terms of storage required and the runtime cost of the SQL insert, both of which could be relevant. If your project is impacted by the amount and/or frequency of auditing I suggest you investigate the audit levels in a similar manner to that described above to determine what is best for you & your business.

Manual Recovery Mechanisms in SOA Suite and AIA

$
0
0

Introduction

Integration flows can fail at run-time with a variety of errors. The cause of these failures could be either Business errors or System errors.  When Synchronous Integration Flows fail, they are restarted from the beginning. On the other hand, Asynchronous Integration flows when they error can potentially be resubmitted/recovered from designated/pre-configured milestones within the flow. These milestones could be persistence points like queues topics or database tables, where the state of the flow was last persisted. Recovery is a mechanism whereby a faulted Asynchronous Flow can be rerun from such a persistence milestone. The SOA Suite 11g and AIA products provides various Automated and Manual recovery mechanisms to recover from asynchronous fault scenarios. They differ based on the SOA component that encounters the error. For instance, recovering from a BPEL fault may be quite different than recovering from a Resequencer fault. In this blog, we look at the various Manual Recovery mechanisms and options available to an end user. Manual recovery mechanisms require an Admin user to take appropriate action on the faulted instance from the Oracle Enterprise Manager Fusion Middleware Control [EM FMWC Console] The intention of this blog is to provide a quick reference for Manual Recovery of Faults within the SOA and AIA contexts. It aims to present some of the valuable information regarding Manual recovery in one place. These are currently available across many sources such as SOA Developers Guide, SOA Admin Guide, AIAFP Developers Guide and AIAFP Infrastructure and Utilities Guide. Next we look at the various Manual recovery mechanisms available in SOA Suite 11g and AIA, starting with the BPEL Message Recovery.

BPEL Message Recovery

To understand the BPEL Message Recovery, let us briefly look into how BPEL Service engine performs asynchronous processing. Asynchronous BPEL processes use an intermediate Delivery Store in the SOA Infrastructure Database to store the incoming request. The message is then picked up and further BPEL processing happens in an Invoke Thread. The Invoke Thread is one among the free threads from the ‘Invoke Thread Pool’ configured for BPEL Service Engine. The processing of the message from the delivery Store onwards until the next dehydration in the BPEL process or the next commit point in the flow constitutes a transaction. Figure below shows at a high level the Asynchronous request handling by BPEL Invoke Thread.1

Any unhandled errors during this processing will cause the message to roll back to the delivery Store. The Delivery Store acts as a safe milestone for any errors that cause the asynchronous BPEL processing to rollback. In such scenarios these messages sitting in the delivery store can be resubmitted for processing using the BPEL Message Recovery. It is quite similar in case of Callback messages that arrive for in-flight BPEL process instances. The Callback messages are persisted in the Delivery Store and a free thread from the Engine Thread Pool will perform correlation and asynchronously process the callback activities. Callback messages from Faulted activities are available at the Delivery Store for Recovery.

Refer to the section from FMW Administrator’s Guide here - http://docs.oracle.com/cd/E28280_01/admin.1111/e10226/bp_config.htm#CEGFJJIF for details on configuring the BPEL Service Engine Threadpools.2

The Recovery of these Invoke/Callback messages can be performed from the Oracle Enterprise Manager Fusion Middleware Control [EM FMWC Console] [SOA->Service Engine->BPEL->Recovery]. The admin user can search for recoverable messages by filtering based on available criteria on this page. The figure below shows the BPEL Engine Recovery page where the messages eligible for recovery are searched based on the message type and state. 3 During Recovery of these messages, the end user cannot make any modifications to the original payload. The messages marked recoverable can either be recovered or aborted. In the former case, the original message is simply redelivered for processing again. The BPEL Configuration property ‘MaxRecoverAttempt’ determines the number of times a message can be recovered manually or automatically. Messages go to the exhausted state after reaching the MaxRecoverAttempt. They can be selected and 'Reset' back to make them available for manual/automatic recovery again. In addition to Invoke and Callback messages, the BPEL Recovery console can also be used to recover activities which have an expiration time associated like the Wait activity. Expired Activities can be searched for and recovered. These would then be rescheduled for execution. The BPEL Service Engine can be configured to automatically recover failed messages, either on Server startup or during scheduled time periods. Refer to the section from FMW Administrator’s Guide here - http://docs.oracle.com/cd/E28280_01/admin.1111/e10226/bp_config.htm#CDEHIIFG for details on setting up auto recovery.

SOA Fault Recovery

The Fault Handling for invocations from SOA Components can be enhanced, customized and externalized by using the Fault Management Framework (FMF). We will not go into the details of Fault Management Framework here. Refer to this a-team blog post here - http://www.ateam-oracle.com/fault-management-framework-by-example for insights into the FMF.

In short FMF, allows a Fault Policy with configurable Actions to be bound to SOA Component. These can be attached at the Composite, Component or Reference levels. The configured Actions will be executed when the invocation fails. The available Actions could be retry, abort, human intervention, custom java callout, etc. When the Action applied is human intervention the faults become available for Manual Recovery from the Oracle Enterprise Manager Fusion Middleware Control [EM FMWC Console]. They show up as recoverable instances in the faults tab of ‘SOA->faults and rejected messages’ as shown in the figure below

4

During the recovery, the Admin user can opt for one among Retry, Replay, Abort, Rethrow or Continue as the Recovery option. For Retry options, the EM User has access to the payload. The payload can be changed and resubmitted during recovery. While this is a useful feature, it could pose audit/security issue from an administrative perspective if it is not properly controlled using Users/Roles. The retry action can also chain the execution into a custom java callout to do additional processing after a successful Retry. This is selected from the ‘After Successful Retry’ option during Retry. The custom java callout should be configured in the Fault Policy file attached to the composite. 5

Resequencer Recovery

Mediator Resequencer groups which end up in Errored or Timed Out states can be recovered from the EM Console by an Admin user. In fact Resequencer faults do not have other automated recovery mechanisms and rely on only Manual recovery by the admin for their remediation. Mediator Resequencer faults can be searched and filtered from the faults page of the Mediator component. Figure below shows a search of faults by the Resequencing Group. 6 An Errored group can be recovered by either choosing to Retry or Abort. Retry will reprocess the current failed message belonging to the faulted Group. In case of abort, the current failed message is marked as failed and processing will resume from the next in sequence available message for the faulted Group. In both cases the group itself is unlocked and set to ready so that it can process further messages. As can be seen in the figure below, the Admin user can modify the request payload during this recovery. 7 In case of Standard Resequencer, groups can end up as Timed Out when the next in sequence message does not arrive until the timeout period. Such groups can be recovered by skipping the missing message. Figure below shows such a recovery. In this case the processing of the group will continue from the next message rather than wait for the missing sequence id. 8    9  

AIA Message Resubmission

This section deals with Integrations built using the AIA Foundation Pack. Refer to the AIA Concepts and Technologies Guide at http://docs.oracle.com/cd/E28280_01/doc.1111/e17363/toc.htm to familiarize with the AIA concepts. Let us see a common design pattern employed in AIA Integrations. The Figure below is from the AIA Foundation Pack Developers Guide Document available and shows an architecture used for Guaranteed Message Delivery between Source and Target applications with no intermediate persistence points. The blocks shown are SOA Suite Composites. The Source and Target milestones are persistence points such as Queue, Topics or Database tables. The same design can also be enhanced to have multiple intermediate milestones in case of more complex flows. Such Flows are commonly seen in AIA Pre Built Integrations which use Asynchronous flows to integrate systems. E.g. Communications O2C Integration Pack for Siebel, BRM and OSM. Refer to the Pre Built Integrations Documentation here-  http://www.oracle.com/technetwork/apps-tech/aia/documentation/index.html10   The salient points of this design are
  • A Single transaction performs the message consumption from source milestone, the processing and the message delivery to target milestone
  • All blocks are implemented using SOA Suite composites
  • Any errors/faults in processing, rollback the message all the way to the source milestone
  • Milestone Queues and Topics are configured with Error Destinations to hold the rolled back messages for resubmission.
  • An enhanced Fault message (AIAFault) is raised and stored in the AIA Error Topic. This Fault has sufficient information to resubmit the message from the Source milestone.
  The Faults can be recovered using the AIA Error Resubmission Utility. In AIA Foundation Pack 11.1.1.7.0 the AIA Error Resubmission Utility is a GUI utility and can be used for single or bulk fault recoveries. The Resubmission Utility can be accessed from AIA home-> Resubmission Utility of the AIA application, as shown in figure below. Earlier versions of AIA foundation Pack 11g only have a command line utility for Error Resubmission. This is available at <AIA_HOME>/util/AIAMessageResubmissionUtil. 11-2 Any fault within the flow will roll back to the previous milestone or recovery point and enable resubmission from that point. The milestones could be Queues, Topics or AQ destinations. The Queues and Topics designed to be milestones are associated with corresponding Error Destinations. This is where the faulted messages reside. The AIA Resubmission Utility simply redelivers the messages from the fault destination back to the milestone destination for reprocessing in case of Queue or Topic. In the case of Resequencer errors, the Resequencer is the Recovery point and holds the message for resubmission. Note that Resequencer is not typically designed as a milestone in the flow but acts as a recovery point for Resequencer errors. For such errors, the AIA Resubmission utility recovers the failed Resequencer message and also unlocks the faulted Resequencer Group for further processing. It is important to note here that the AIA Error Handling and Resubmission Mechanism is a designed solution. It relies on the fact that the Integration implements the principles and guidelines of AIA Foundation Pack and AIA Guaranteed Message Delivery pattern for its accurate functioning. Refer to the AIA Foundation Pack Infrastructure and Utilities Guide at http://docs.oracle.com/cd/E28280_01/doc.1111/e17366/toc.htm for details of the AIA Error Handling Framework and AIA Resubmission utility. Refer to the AIA Foundation Pack Developers Guide at http://docs.oracle.com/cd/E28280_01/doc.1111/e17364/toc.htm for implementing AIA Error Handling and Recovery for the Guaranteed Message Delivery Pattern.

Use Case: Message Resubmission with Topic Consumers

Let us next look at a use case from one of our customer engagements. It is a Custom Integration developed using AIA Guaranteed Message Delivery pattern and employing the AIA Resubmission utility for recovery. We can see how the above recovery mechanisms offer different options when designing a typical integration flow Without going deep into the details, the figure below shows at a high level the design used for delivering messages to 3 End Systems using a Topic and 3 Topic Consumers. The BPEL ABCS components consume the canonical message, convert it to the respective Application specific formats and deliver to the End Systems. The requirement was the guarantee delivery of the message to each of the 3 systems within the flow, which the design achieves under normal circumstances.AIA3  

However issues were observed at run-time for failure cases. When the message delivery fails for one of the Systems e.g. System B, the design caused a rollback of the message to the previous milestone which in this case is the Topic. The rolled back message residing in the Error destination is then redelivered to the Topic. The message is picked for processing again by all 3 Topic Consumers causing duplicate message delivered to Systems A and C.

This issue can be addressed in a few ways;

1) Introducing an intermediate milestone after the message is consumed off the topic. For instance we could introduce a queue to hold the converted messages. (Indicated by point 1 in the figure)

2) Use separate queues instead of a topic to hold the canonical messages.

In case of failures, only the message in the failed branch would have to be recovered using AIA Message Resubmission as seen in section above.

However, both these options introduce additional queues which need to be maintained by the operations team. Also if in future an additional end systems were to be introduced, it would necessitate adding new queues in addition to new JMS consumers and ABCS components.

3) Introduce a transaction boundary: This can be done by changing the BPEL ABCS component use an Asynchronous One Way Delivery Policy. In this case, any failures cause the message to rollback not to the topic but to the internal BPEL Delivery store. (Indicated by point 2 in the figure) These messages can then be recovered manually using Bpel Message Recovery as we saw in the first section above. The recovery is limited only to the faulted branch of the integration.

4) Another option is to employ Fault Policies. We can attach a Fault Policy to the BPEL ABCS component. The policy invokes the human intervention action for faults encountered during end system invoke. The message can then be manually recovered from the EM FMWC Console as seen in the SOA Fault Recovery Section above. This would apply only to the faulted branches and hence avoid the duplicate message delivery to the other End Systems.

Also another issue seen was that the end systems would lose messages that arrived when the consumers are offline. This problem can be addressed by configuring durable subscription for the Topic consumers. In the absence of Durable subscriptions, a Topic discards a message once it has been delivered to all the active subscribers. With Durable Subscribers, the message is retained until delivered to all the registered durable subscribers hence guaranteeing message delivery. Refer to the Adapters Users Guide here - http://docs.oracle.com/cd/E28280_01/integration.1111/e10231/adptr_jms.htm for details on configuring Durable Subscriptions for Topic Consumers.

Summary

Table below summarizes the different Manual recoveries that we have seen and their main characteristics A ready reckoner table listing the manual recovery options and their characteristics

In this blog, we have seen the various Manual Fault Recovery mechanisms provided by SOA Suite and AIA 11g versions. We have also seen the security requirements for the Admin user to perform recovery and the variety of options available to handle the faults. This knowledge should enable us to design and administer robust integrations which have definite points of recovery. -Shreeni

Effect of Queue and JCA Settings on Message Retry by JMS Adapter

$
0
0

Introduction

This blog is intended to share some knowledge about the effects of Queue Level Redelivery Settings and Adapter level Retry Settings on message processing by JMS Adapter.  It is also intended to provide some useful insights that help in designing retry mechanisms into an integration system. Specifically, this blog illustrates the Retry behavior of JMS Adapter and how it is impacted by these settings.

Detail

Consider an Integration system that uses JMS Adapter to consume message from a queue and deliver to an End System after performing BPEL processing. The Figure below depicts such a simple system. Note that the source queue is configured with an error queue to hold the failed messages. JMSSystem

Adapter level Retry Settings

First, let’s look at the Adapter level Retry Settings that are available to configure on the JMS Adapter Consumer service. The settings below are typically used to configure the retry behavior of inbound JMS Adapter consumers:
  1. Jca.retry.count --> example value 3 Jca.retry.interval --> example value 2 Jca.retry.backoff --> example value 2
Assume that the end system is down. During such an error condition when the message cannot be successfully processed and delivered to the End System, the JMS Adapter retries the processing of the failed message, using the above retry settings. For the above example values, the adapter retries a failed message after 2 , 6  and 14 seconds from the time of first failure. Now, assume the end system is down even after the 3 retries.  The expectation in most integration flows is that the message rolls back to the source queue and can be found in the error destination. This will help in manually recovering the failed message after the error condition is resolved.
However, under certain conditions, the JMS Adapter can reject a failed message after exhausting the configured number of retries. When this happens, the message is no longer available at the source queue for recovery. The rejected messages are handled by the Adapter Rejection Handler. Refer here for details on rejection handlers for Adapters.

Queue Level Redelivery Settings

At this point, let us look at Queue Level Redelivery Settings. When redelivery is set at the queue level, any messages that fail to process and which are rolled back to the queue will be redelivered for processing. If the number of failures for the message exceeds the redelivery count, the message is redirected to the error destination.
All messaging providers support some form of queue redelivery settings. For instance, a Weblogic JMS has the Redelivery Limit setting, and AQ JMS provides the same using the max_retries setting of a queue.

Messaging Provider

Redelivery settings

Other related settings

Weblogic JMS

Redelivery Limit

Expiration Policy=Redirect, Redelivery Delay Override

AQ JMS

Max_retries

Retry_delay

Note that Weblogic JMS can be configured to discard or log the failed messages instead of redirecting to an error destination.

Failure to Rollback

Under what conditions does JMS Adapter reject messages that are submitted to it by the queue for reprocessing? When the queue redelivers a message to the adapter more number of times than it can retry, the adapter rejects the message Hence, the condition below will ensure that the message properly rolls back to the source queue rather than be rejected by the Adapter.

Number of Redeliveries by the Queue <=  Retry Count of Adapter Service

Note that when Jca.retry.count is not set at the adapter service level, the GlobalInboundJcaRetryCount setting takes effect. The default value of GlobalInboundJcaRetryCount is -1, which implies an infinite number of retries. Refer to the Adapter’s Guide section here for more information on setting the retry properties.
The table below lists some sample values of the settings and the behavior observed after repeated failures:

Jca.retry.count

GlobalInboundJcaRetryCount

Queue Redelivery

Behavior after repeated Failure

3

-1

5

Message rejected by adapter

6

-1

5

Message Rolled back to Error Queue

0

-1

0

Message Rolled back to Error Queue

Not set

5

5

Message Rolled back to Error Queue

Not set

0

2

Message rejected by adapter

 

Summary

Incorrect settings of the queues and adapters could lead to undesired behavior in recovery of messages during failure conditions. We have seen a few such situations in this blog. With proper settings, we can design integration systems to exhibit consistent error handling and recovery behaviour.

References

-Shreeni

Index of Oracle Event Processor articles

Improving Performance via Parallelism in Oracle Event Processing Pipelines with High-Availability

$
0
0

Improving Performance via Parallelism in Oracle Event Processing Pipelines with High-Availability

  This posting explains how to use parallelism to improve performance of Oracle Event Processing (OEP) applications with active-active high-availability (HA) deployments. Parallelism is exploited for performance gain in each one of the server instances of an HA configuration. This is achieved by identifying sections of an application’s processing pipeline that can operate in parallel, and, therefore, can be mapped to separate processing threads. Both pipeline and independent query parallelism are described.  

Pipeline Parallelism

A pipeline architecture has inherent concurrency because each of its stages works in parallel on different data elements flowing through it. For example, in the pipeline in figure 1, if each stage is assigned its own processing thread, the following actions can occur in concurrently: input JMS adapter reads event #3 from a JMS topic, CQL query processor handles event #2, and output JMS adapter writes event #1 to a queue.  
Figure 1. OEP Pipeline with three concurrent stages
 

figure1

Although OEP HA pipelines are limited to one thread per stage, significant performance gains can be achieved by running each stage in a separate thread as compared to running all stages on one thread or in a number of threads smaller than the number of pipeline stages. A key constraint in OEP when using active-active HA (see Oracle Fusion Middleware Developer's Guide for Oracle Event Processing 11g Release 1 (11.1.1.7) for Eclipse, section 24) is that it requires the input streams to both the primary and the secondary instances to be identical and to maintain the same event ordering as events flow through the OEP Event Processing Network (EPN). This constrain limits the EPN topology to be either a linear pipeline, starting from an input adapter and ending with an output adapter, or a tree where, each node with downstream branching replicates every event to each of its branches. The event ordering requirement also limits to one the number of threads assigned to each stage of the EPN. Having more than one thread in one stage, for example, in an input JMS adapter, would fail to assure that the order of events entering the following stage, such as input channel, is the same in both the primary and secondary instances. The reason event order cannot be assured when using multiple threads on a pipeline stage is that a pipeline stage operates as a queue with multiple worker threads serving the queue. Since the execution times in serving each event and the thread scheduling order cannot be maintained in complete alignment across the primary server instance and the secondary instances, the order of events passed to the following pipeline stages could be out or order across HA instances. The mechanism recommended in OEP best practices for assuring the primary and secondary instances of an HA configuration receive identical input streams is a JMS topic. Since processing speeds of consecutive stages can vary, buffers are used to couple stages and hold the output of one stage while the following stage can consume it.  In OEP these inter-stage buffers are the EPN channels. In addition to operating as buffers, EPN channels also are used as the configuration mechanism to specify the number of processing threads assigned to the stage following the channel. A channel’s buffer length and the number of threads assigned to its following stage are defined within a corresponding channel element in the EPN’s META-INF/wlevs/config.xml file by assigning values to the max-threads and the max-size parameters. For example the inputChannel element in the pipeline in figure 1 is configured as follows:
<channel>
    <name>inputChannel</name>
    <max-size>1000</max-size>
    <max-threads>1</max-threads>
</channel>
For input adapters, which don’t have a preceding channel, assignment is done by setting to one the concurrentConsumers property in the corresponding JMS input adapter element in the META-INF/spring/MonitoracaoTransacao.xml file:
<wlevs:adapter id="jmsInputAdapter" provider="jms-inbound">
    <wlevs:listener ref="inputChannel" />
    <wlevs:instance-property name="converterBean" ref="jmsMessageConverter" />
    <wlevs:instance-property name="concurrentConsumers" value="1" />	
    <wlevs:instance-property name="sessionTransacted" value="false" />
</wlevs:adapter>

Query Parallelism

Query parallelism refers to processing stages where there are multiple independent queries applied simultaneously to each event of the input stream. This is achieved by having a channel with multiple downstream elements, where each event flowing through the channel is broadcast to all of the channel’s downstream elements. This is illustrated in figure 2, where the input channel has five downstream processors, each one running a concurrent query. As explained above, the topology resulting from this type of scenario is a pipeline tree, as opposed to a linear pipeline topology.  
Figure 2. Query parallelism
 

figure2

Each of the five concurrent queries in figure 2 is configured to be independent of the other because each one consumes a separate copy of every event that flows out of the input channel. This configuration forks the single pipeline of the JMS input adapter followed by the input channel into five independent pipelines comprising a CQL query processor followed by an output channel and followed by an HA and JMS output adapter pair. To increase performance, each of these forked pipelines can be treated as an independent linear pipeline whose stages can be parallelized. In the example in figure 2, on each of the branch pipelines, the CQL query processor is assigned one thread, and the HA and JMS output adapter pair is assigned also one thread. Thread assignment for the CQL query processor stage is defined in the input channel configuration element in META-INF/wlevs/config.xml by setting the max-threads property to 5 as follows:
<channel>
    <name>inputChannel</name>
    <max-size>1000</max-size>
    <max-threads>5</max-threads>
</channel>
max-threads should not be larger than the number of processors fanning out from the input channel. This configures a pool of threads capable of handling one event simultaneously on each of the forked pipelines. The remaining stages in each of the pipelines are assigned one thread as in the single linear pipeline case. In summary, even tough HA OEP configurations have strong event ordering requirements that prevent parallelism on each stage, there is still end-to-end pipeline concurrency that can be effectively exploited by assigning at most one thread to each element on each of the linear pipelines in an OEP EPN.    

Code Coverage for BPMN

$
0
0

Introduction

I visited a customer recently who asked a very interesting question.... they'd been performing a series of stress tests of their BPM project made up of many & complex BPM processes and they wanted to know if there were any activities/paths in any of their processes which they hadn't traversed... sort of like "Clover" for BPM, This led me to thinking about BPM auditing and cross-referencing this with the BPM activities.

BPMN Code Coverage: The Theory

Let us take a look at the relevant tables in the SOAINFRA schema....

BPM_AUDIT_QUERY

Providing that the audit level has been set sufficiently high (for example "Production" would do), this table stores details of all BPMN activities instantiated at any given time.

BPM_CUBE_ACTIVITY & BPM_CUBE_PROCESS

These tables are a static view of all activities in all deployed process at any given time.

Deployed BPM activities not in BPM_AUDIT_QUERY

It became obvious that selecting all activities in the join of BPM_CUBE_ACTIVITY and BPM_CUBE_PROCESS for a given deployed process/composite which did not exist in BPM_AUDIT_QUERY during a given time period would highlight activities not invoked as part of out testing. As a result I ended up with a piece of SQL thus.... CBPM_09 ...i.e. which activities in processes "BpmClover" and "BpmCallable" were not traversed in the last 24 hours.

BPMN Code Coverage: The Practice

I needed a fairly simple process to test with, not too complex but with a good selection of activities, human tasks, boundary events, gateways etc... and ended up with the following (not BPMN best practices by any means).... CBPM_01 ...i.e. a main process and a callable sub-process.

Test the Conditional Gateway

The associated flow trace.... CBPM_02 ...and the query result.... CBPM_03 ...still a lot of activities not run.

Test the Default Gateway & Complete the User Activity

The associated flow trace.... CBPM_04 ...and the query result.... CBPM_05 ...so the callable sub-process has been run in full and the only things left to cover are the "boundary timer event" and associated "end".

Test the Boundary Event

If we let the human activity time-out we get the associated flow trace.... CBPM_06 ...and the query result.... CBPM_07 ...i.e. nothing left to run.

Summary

There is one big caveat with this solution... we are querying the underlying SOAINFRA tables and these could change at any time with a new release of the product.... this testing was carried out against PS6. It's also worth noting that the query does not take into account the composite version though this would be easy to add. This is however a quick and simple way of determining any gaps in testing, it would be very easy to create a PLSQL procedure from the SQL and parameterize the date range and process names to create a very flexible function.

BPM Process Instances – Faults, Rollback & Recovery – Part 1

$
0
0

Introduction

This is part 1 of a 4 part blog explaining how the BPM engine functions under the covers when "faults" occur, be they unhandled technical faults or failures at the engine level. Part 1 - will set the scene by explaining timeouts and their values & fault handling Part 2 - will explain how the BPM engine handles messages, threads & transaction Part 3 - will explain how & when the BPM engine rolls back transactions Part 4 - will show how BPM messages can be recovered after a rolled back transaction

Part 1: Setting the Scene - BPM Engine Timeouts & Fault Handling

The BPM engine by its very nature will contain many long running process instances and it is essential that BPM project & operational teams understand how instances are handled inside the engine, how faults can be handled, how & why transaction rollbacks happen and how instances can be recovered. Within Oracle support, the A-Team PTS, PM we hear frequently of customers who have process instances that are “stuck”, or they have a server failure and wonder where their instances have gone. In this document we will try to understand what has happened and how to recover cleanly.

BPM Engine Timeouts

One of the most important concepts to understand with the BPM engine is the level at which timeouts can occur.....

Global Java Transaction Timeout – JTA

This is the broadest level of timeout inside SOA, the java transaction timeout. It can be set in the Weblogic Server Administration Console in the relevant domain.... BPMR_01 If timeouts occur at the level of JTA they cannot be caught by the BPM process either as a “catch” activity within a process or as part of an overall fault policy, however the instances will rollback to the last dehydration point (see later).

BPM EJB Timeout

The BPM engine itself uses a number of EJBs to control threads, these also have timeout values which should be set and they can be found in the Weblogic Server Administration Console under the soa-infra deployment.... BPMR_02 BPMR_03 ...note that at the time of writing (PS6) BPMNActivityManagerBean did not have a timeout property, it will be necessary to apply a patch to set this value. As for the JTA timeout, any timeouts at the level of the EJBs cannot be caught in the process or by a fault policy but instances will rollback to the previous dehydration point.

Resource Timeout

This the most local level of timeout, i.e. a call to a database times out, a call to a webservice times out. The timeout value can be set on the resource itself in the Enterprise Manager Console, e.g. for a database adapter in a composite.... BPMR_04 BPMR_05 ...note that if the property does not appear it can be added as follows.... BPMR_06  

Setting Timeout Values

The general rule of thumb that should be followed is.... JTA Timeout > BPM EJB Timeout > Resource Timeout ....following this will ensure that timeouts can be handled at the local level, i.e. caught by a “catch” activity within the process or by a fault policy.

Fault Handling

This topic is covered in great detail both within the official Oracle SOA Suite documentation and in numerous blog entries elsewhere so there will only be a cursory overview here. As a general guideline technical faults, such as a remote exception, should be caught by an appropriate policy in the fault policy framework and business faults, such as “no account found”, should be caught within the process itself either as a boundary catch activity or a process-level catch activity. In either case the actions following a caught fault will probably follow a pattern similar to... retry “x” times with a “y” backoff, and if this still fails, direct to manual intervention. In the case of the fault policy framework, this will result in the instance being recoverable in the Enterprise Manager, and in the case of a catch within the process itself, a redirect to a manual activity. It is worth noting that in both these cases it will be possible to manipulate the message data itself. Also worth noting is the Alter Flow functionality inside Oracle BPM which allows business users to reposition the currently active business activity within the process instance itself and also to manipulate the instance data. This can be particularly useful in situations where the instance is in "suspended" state, possibly due to a "selection failure" caused by unassigned xml elements in the payload... in this case "Alter Flow" can be the only option for recovery... this is not covered as part of this series of blogs.

Summary

In this first part in the series we have covered some groundwork necessary for understanding BPM engine faults, rollback & recovery, primarily the various timeout values and the role of fault handling. In the next part we will take some typical BPM process patterns and show how the BPM engine handles messages, threads and transactions.

BPM Process Instances – Faults, Rollback & Recovery – Part 2

$
0
0

Introduction

This is part 2 of a 4 part blog explaining how the BPM engine functions under the covers when "faults" occur, be they unhandled technical faults or failures at the engine level. Part 1 can be found here.

Part 2: Understanding BPM Messages, Threads & Transactions

Given SOA Suite & BPM’s ability to control timeouts and to handle faults, why do we need to understand any further inside the BPM engine ? Well, there are always going to be exceptional circumstances, such as runtime errors in the engine itself (e.g. NullPointerExceptions) caused by internal or external events (Out of Memory, stuck threads etc.) and alongside this is the likelihood that however thorough the testing carried out there will be some unforeseen scenarios that have not been handled appropriately in the design. In both of these circumstances the affected instances will need to be recovered somehow, and we’ll now look at how the BPM engine handles threads and how failed instances can be recovered. As an introduction to this topic I would advise the excellent presentation by David Read (BPM PM) found here.

Local Optimization

It is important to understand the concept of “local optimization” in SOA Suite... essentially this will be enabled by default and means that any service calls which remain within the same weblogic cluster will be optimized to be routed to the same weblogic managed server instance, i.e. there will be no routing out via the load balancer and no HTTP/SOAP, all calls will be optimized to be java calls and therefore use the same thread as the client.

BPM Process Patterns

In order to better demonstrate messages & threading inside the BPM engine we will use the following three common process interaction patterns....

Pattern 1 – Async - Async

This is a very common pattern within BPM, an asynchronous process calls an asynchronous process via a send/receive activity.... BPMR_07

Pattern 2 – Async – Sync

Another common pattern, normally an asynchronous BPM process calling a synchronous SOA service such as a mediator composite in this case.... BPMR_08

Pattern 3 – Async with Acknowledgement – Sync

The final common pattern, again an asynchronous client process calling a synchronous SOA service, but this time the client sends an acknowledgement back to its caller immediately after being invoked in order to notify it that it will continue to process asynchronously.... note that in order to effectively respond with the acknowledgement the process is generally designed with a timer activity to force dehydration and therefore a commit point, otherwise the acknowledgement is not sent back to the client until a further dehydration occurs.... BPMR_09

BPM Process Patterns – Messages & Threads

Now we have seen the patterns we can see how the messages & threads are handled for these within the BPM engine. First let us understand the basics of BPM messages and threads....

BPM Messages

There are essentially two kinds of messages within the BPM engine, “invoke messages” and “callback messages”. Invoke messages – are what drive a process, an invoke message instantiates an invoker thread which then handles the process instance until a dehydration point, i.e. a “Wait” activity, a “Timer” activity or a “Receive” activity. The invoker thread can be an engine thread or in the case of a synchronous call from a client, the client thread itself. Callback messages – are messages that asynchronously arrive back in the BPM engine and must be correlated to a running instance, examples are callbacks from the workflow service when a human task has been acted upon, or callbacks from an asynchronous service which has completed.

DLV_MESSAGE, DLV_SUBSCRIPTION & WORK_ITEM Tables

Both of these messages are stored in the SOAINFRA table DLV_MESSAGE which acts as a message tracker for BPM processes and will be the starting point for any recovery scenario we cover later. In the case of “callback” messages a row is also written to the DLV_SUBSCRIPTION table and used to correlate the incoming callback message to the process instance. Note that the actual payload of the message is not stored here, it is stored in XML_DOCUMENT table. BPMR_10 The important columns here are “CONV_ID” which is a unique identifier for the message, “CONV_TYPE” which identifies whether this message is an invoke or callback message and “STATE” which identifies which identifies whether the message itself has been handled etc... Also of importance is the WORK_ITEM table which contains information & state of certain BPMN activities, we are interested in this because of timer activities. BPMR_11 The states we are interested in on these tables are as follows.... BPMR_12 BPMR_13 BPMR_14

Pattern 1 – Async - Async

Let’s look at how this knowledge of messages can be applied to the patterns we’ve seen starting with the standard async-async.... BPMR_15 TX1 – Transaction 1 from the client inserts a message into DLV_MESSAGE of type “INVOKE” and state “UNDELIVERED” TX2 – Transaction 2 updates the message to state “DELIVERED” and continues until the dehydration point at the “receive” activity, via inserting another message into DLV_MESSAGE of type “INVOKE” for the async service process. TX3 – Transaction 3 inserts a new message into WORK_ITEM of type “RECEIVE” and state “PENDING” and a message into DLV_SUBSCRIPTION with a state of state “UNRESOLVED” TX4 – Transaction 4 updates the DLV_MESSAGE to “DELIVERED” and inserts a WORK_ITEM for the timer of state “PENDING” TX5 – Transaction 5 updates the WORK_ITEM message with state “CLOSED” and continues till the end of the called process. TX6 – Transaction 6 updates the WORK_ITEM message for the main process to state “CLOSED”, sets the state of the DLV_SUBSCRIPTION table to “HANDLED” and continues till the end of the main process. We can test this out to see what happens in the relevant tables by giving the “gotoSleep” activity a large value such as 4 minutes and running a test.
During Testing
This is what we see during the test while “gotoSleep” is active.... DLV_MESSAGE BPMR_16 ...i.e. an INVOKE message for each of the “start” activities in the two processes in state “STATE_HANDLED” WORK_ITEM BPMR_17 ...i.e. a WORK_ITEM for the “Receive” activity in the client process of state “open_pending_complete” and a similar WORK_ITEM for the “gotoSleep” activity with a state “open_pending_complete” DLV_SUBSCRIPTION BPMR_18 ...i.e. a subscription of state UNRESOLVED for the called service.
After Test Completion
This is what we see once the test has successfully completed.... DLV_MESSAGE BPMR_19 ...i.e. an extra entry for the “end” activity  of type “DLV_MESSAGE” and state “STATE_HANDLED” WORK_ITEM BPMR_20 ...i.e. both WORK_ITEM entries are now in state “CLOSED_FINALIZED” DLV_SUBSCRIPTION BPMR_21 ...i.e. the subscription is now in state HANDLED

Pattern 2 – Async – Sync

BPMR_22 TX1 – Transaction 1 from the client inserts a message into DLV_MESSAGE of type “INVOKE” and state “UNDELIVERED” TX2 – Transaction 2 updates the message to state “DELIVERED” and continues until the end of the process. Note how “local optimization” affects the threading here... the same java transaction (TX2) is used for everything from the “Start” in the BPM process, through the mediator and DB adapter and back to the “End” activity in the BPM process. We can test this out to see what happens in the relevant tables by setting the DBSleep stored procedure to sleep for an appropriate amount of time and running a test.
During Testing
This is what we see during the test while the DBSleep stored procedure is sleeping.... DLV_MESSAGE BPMR_23 ...i.e. an INVOKE message for the “start” activity in the process in state “STATE_UNRESOLVED” Nothing in WORK_ITEM or DLV_SUBSCRIPTION as expected.
After Test Completion
This is what we see once the test has successfully completed.... DLV_MESSAGE BPMR_24 ...i.e. the INVOKE message now has state “STATE_HANDLED”

Pattern 3 – Async with Acknowledgement – Sync

BPMR_25 TX1 – Transaction 1 from the client inserts a message into DLV_MESSAGE of type “INVOKE” and state “UNDELIVERED” TX2 – Transaction 2 updates the message to state “DELIVERED” and continues until the dehydration point at the “timer” activity. TX3 – Transaction 3 inserts a new message into WORK_ITEM with state “3 – OPEN_PENDING_COMPLETE” TX4 – Transaction 4 updates the “WORK_ITEM” row with state “6 – CLOSED_FINALIZED” and continues till the end of the process. Note how “local optimization” affects the threading here... the same java transaction (TX4) is used for everything from the “CatchEvent” in the BPM process, through the mediator and DB adapter and back to the “End” activity in the BPM process. We can test this out to see what happens in the relevant tables by setting the DBSleep stored procedure to sleep for an appropriate amount of time and running a test.
During Testing
This is what we see during the test while the DBSleep stored procedure is sleeping.... DLV_MESSAGE BPMR_26 ...i.e. an INVOKE message for the “start” activity in state “STATE_HANDLED” WORK_ITEM BPMR_27 ...i.e. a WORK_ITEM for the “CatchEvent” timer activity in the client process of state “open_pending_complete
After Test Completion
This is what we see once the test has successfully completed.... DLV_MESSAGE BPMR_28 ...i.e. no difference. WORK_ITEM BPMR_29 ...i.e. WORK_ITEM enty is now in state “CLOSED_FINALIZED”

Summary

In the second part in the series we have looked at some typical process patterns, the important tables in SOAINFRA and what data is added to these tables as a process instance starts and moves to completion. In the next part we will look at what happens when an uncaught exception occurs and the instances roll back.

BPM Process Instances – Faults, Rollback & Recovery – Part 3

$
0
0

Introduction

This is part 3 of a 4 part blog explaining how the BPM engine functions under the covers when "faults" occur, be they unhandled technical faults or failures at the engine level. Part 1 can be found here.

Part 3: Understanding BPM Messages Rollback

Now that we’ve seen how the important SOAINFRA tables are used by the engine we can look at how unhandled exceptions are rolled back by the engine to the last dehydration point. Remember, using appropriate fault policies and catch activities with BPM should avoid the vast majority of rollbacks but as mentioned previously, these can still occur.

Pattern 1 – Async – Async

We can assume here that no fault policy framework exists and that no process level fault handling exists either. Given that this scenario is completely asynchronous we are not going to be able to generate any timeouts, we’ll have to simulate a failure at the engine level by introducing a failing script activity (map string to number) into the process.... BPMR_30 BPMR_31 ...now running the test we have a failure.... BPMR_32 Remember we haven’t caught this anywhere, how can we recover this instance ? Let’s take a look at the relevant tables.... DLV_MESSAGE BPMR_33 ...i.e. an INVOKE message for the “start” activities in the client process of “STATE_HANDLED” and an INVOKE message for the “start” activity of the called process in state “STATE_UNRESOLVED” WORK_ITEM BPMR_34 ...i.e. a WORK_ITEM for the “Receive” activity in the client process of state “open_pending_complete DLV_SUBSCRIPTION BPMR_35 ...i.e. a subscription of state UNRESOLVED for the called service. So we can infer from the DLV_MESSAGE table that the instance has been rolled back to the last dehydration point, i.e. the “start” activity in the called service so we should be able to “recover” from here.... BPMR_36

Pattern 2 – Async – Sync

For this scenario we will force an EJB timeout by setting the value of the DBAdapter timeout to 400 seconds (greater than the EJB timeout of 300 seconds).... BPMR_37 ...and we’ll set the stored procedure to run for 500 seconds. When we run the test we see the following in the flow trace.... BPMR_38 ... here we see that there has been a “TransactionRolledBackLocalException”, that the BPM process “TestDBTimeout” faulted after around 300 seconds (EJB timeout) and the DBAdapter itself timed out over a minute later. So where does this leave the underlying tables ? DLV_MESSAGE BPMR_39 ...i.e. an INVOKE message for the “start” activity in the client process of “STATE_UNRESOLVED”. We can infer from the DLV_MESSAGE table that the instance has been rolled back to the last dehydration point, i.e. the “start” activity in the client process so we should be able to “recover” from here.... BPMR_40

Pattern 3 – Async with Acknowledgement – Sync

Again we will force an EJB timeout by setting the value of the DBAdapter timeout to 400 seconds (greater than the EJB timeout of 300 seconds).... BPMR_41 ...the results are pretty similar to the previous pattern except this time the calling process remains in a “Running” state. So where does this leave the underlying tables ? DLV_MESSAGE BPMR_42 ...i.e. an INVOKE message for the “start” activity in the client process of “STATE_HANDLED”. WORK_ITEM BPMR_43 ...i.e. a WORK_ITEM for the “CatchEvent” timer activity in the client process of state “open_pending_complete We can infer from the above tables that the instance has been rolled back to the timer activity in the client process... BPMR_44

Summary

In the third part in the series we have looked at what happens when a process instances rolls back & what we see in the appropriate SOAINFRA tables. In the next part we will look at how we can recover these rolled back process instances.

BPM Process Instances – Faults, Rollback & Recovery – Part 4

$
0
0

Introduction

This is part 4 of a 4 part blog explaining how the BPM engine functions under the covers when "faults" occur, be they unhandled technical faults or failures at the engine level. Part 1 can be found here.

Part 4: BPM Message Recovery

Idempotence

It is vitally important to understand the conept of idempotence, i.e. the ability to replay activities more than once without any adverse impact. As an example, an activity to credit money to a bank account would not be idempotent whereas an activity to query the money on a bank account would be. This is important since recovering process instances necessarily means that some activities will be replayed. It is a business decision as to whether this recovery of instances is valid or not.

Recover Individual Instances

As a first step, let’s look at how to recover the individual instances that have failed.

Pattern 1 – Async – Async

Via the Enterprise Manager Console we can view the BPMN engine.... BPMR_45 ...and from the “Recovery” tab we can see any uncompleted “Invoke” activities (excluding the last 5 minutes as this is an active table).... BPMR_46 ...and there we see the failed instance which we can recover. Pattern 2 – Async – Sync Via the Enterprise Manager Console as before... BPMR_47

Pattern 3 – Async with Acknowledgement – Sync

Via the Enterprise Manager Console as before... BPMR_48 ...we can’t see the latest instance since it was not rolled back to an invoke, but we can see the actual activity itself, this however is not recoverable.... BPMR_49 We know that it has rolled back to the timer activity and we can recover this by simply clicking the “Refresh Alarm Table” button shown, note that this will refresh all timers, it is a bulk operation. Note this button is only available in PS6 and later versions.

Bulk Recovery of Instances

Now that we have seen how to find and recover individual instances which have failed with various patterns, let’s look at how we can query and recover in bulk. It could be the case that a catastrophic failure has caused managed servers to crash resulting in potentially thousands of failed process instances and at multiple different activities within them. How do we find all “stuck” instances ? Which ones will recover automatically and which will have to be manually recovered ?

Automatic Recovery – Undelivered Messages

By default BPM instances are not recovered automatically on managed server restart, as opposed to BPEL instances which are. This can be verified, or changed if required, in the Enterprise Manager Console (remember idempotence !).... BPMR_50 BPMR_51 ...i.e. on startup, recover all instance during a duration of 0 seconds.

Automatic Recovery – Failed Timers

In contrast, on server restart all timers will re-fire, i.e. in “Pattern 3”, on server restart, the “catchEvent” timer activity will fire again. Also worth noting is any timers which expired whilst the managed server was down will also fire on restart... this could cause a large spike in activity on restart if multiple instances with expired timers are retried. Note also what exactly this “refresh” does – when a WORK_ITEM entry is created for a timer, be it a simple “sleep”.... BPMR_52 ...or a boundary timer on an activity.... BPMR_53 ...then an in-memory timer is created scheduled from the transaction commit listener, when the in-memory timer expires the WORK_ITEM table is updated. A “refresh” will re-invoke the active entries in the WORK_ITEM table thus creating new instances of those in-memory timers, it will not however reset these timers to "zero", i.e. begin the time period again.

Recovery Queries

The above scenarios have covered some common patterns and the message recovery associated with them. The provided scripts cover all possible “stuck instance” scenarios, how to find the instances and how to recover them in bulk. It is advisable to agree on a fixed point in time for the recovery window.  This will ensure that, when you run the various queries we are about to describe, you will get a consistent set of results. The queries below include  “receive_date < systimestamp - interval '1' minute” this is to avoid including in-flight instances. However, you may augment this to query for “stuck” messages up to a particular cut-off date e.g. 01-August 2013.
Querying the DLV_MESSAGE table – Find the “unresolved” messages
As a reminder, the valid values for the STATE column. BPMR_56 “Stuck” messages will be those with the values 0 or 4. “0 – STATE_UNRESOLVED” as we saw earlier in our example scenarios, “4 - STATE_MAX_RECOVERED” could occur if auto-recovery was set to on, or if someone had retried a number of times to resubmit the message from Enterprise Manager Console. We have 2 types of messages – Invoke and Callback. Invoke à DLV_TYPE = 1, Callback à DLV_TYPE = 2 We can query the “stuck” messages for each type as follows.... Simple query on the DLV_MESSAGE table...
  • Group by the dlv_type, composite and component allowing us to isolate where the bulk of the “stuck” messages are.
  • Optionally separate the query into two parts, one for “Invoke”, one for “Callback”
BPMR_57 ...when we run this for our failed scenarios we get the results as expected.... BPMR_58
Querying the DLV_MESSAGE table – for FYI messages
In production scenarios, there may be some rows that you can discard immediately – for example FYI tasks cause messages to be written to the DLV_MESSAGE table – these can essentially be ignored. You could use the following SQL to get the message_guid of such messages and then mark them as cancelled using the Recovery API. Use of the API will be discussed late, suffice to know the example class is called BatchMessageCancelForFYI, included in the Java examples that accompany this document. You may also consider updating the human task (for FYI) as a workaround to avoid these extra messages.  See Patch 16494888. BPMR_59
Querying the DLV_MESSAGE table – Drilling Deeper
We can now concentrate on the “stuck” messages and drill a little deeper to get some context e.g. what activity caused the problem. To do this we can query further tables.... COMPOSITE_INSTANCE, COMPONENT_INSTANCE, CUBE_INSTANCE, WORK_ITEM, BPM_AUDIT_QUERY. Drill down to COMPOSITE_INSTANCE data... BPMR_60 ...when we run this for our failed scenarios we get the following.... BPMR_61 ...i.e we can see here that “Pattern 1” at 05:01 failed (COMPOSITE_STATE = 2) in the “BPMAsyncService” process and that “Pattern 2” at 05:56 failed in the “BPMTestTimeoutAsyncClient” exactly as shown in the failure images above. Drill down to CUBE_INSTANCE data.... Here we can get information on cube state and scope size.... BPMR_62 ...when we run this for our failed scenarios... BPMR_63 ...we can see that “Pattern 1” failed in “BPMAsyncService” and “Pattern 2” failed in “BPMTestTimeoutAsyncClient” as a CUBE_INSTANCE_STATE of “10 – STATE_CLOSED_ROLLED_BACK” Drill down to CUBE_INSTANCE data.... We can now see to what activity we rolled back.... BPMR_64 ...and for our failed scenarios.... BPMR_65 ...this is interesting, we now only see the failure for “Pattern 1”, not for “Pattern 2”, why ? Well, remember “Pattern 2” rolled back all the way to the “Start” message of the client process so no active WORK_ITEM rows exist. For “Pattern 2” we can see that we reached “SCOPE_ID=TestAsyncAsync_try.2” and “NODE_ID=ACT8144282777463”, looking at the original process model.... BPMR_66 ...we can see that the passivation point on the “Receive” activity created a WORK_ITEM entry with state “3 – OPEN_PENDING_COMPLETE”. Drill down to BPM_AUDIT_QUERY data.... If auditing was enabled, we can now see which was the last activity audited.... BPMR_67 ...and for our failed scenarios... BPMR_68 ...we can see for “Pattern 1” the last audited activity before rollback was “ScriptTask”, i.e. we know that it was here we had a failing data association, and for “Pattern 2” the last audited activity was “DBCall”, i.e. it was here that the process timed out.
Timer Queries
Expired Timers In the case where a server has crashed it can be very useful to know how many timers have expired in the downtime, given that on restart of the server they will all re-fire. We can query these as follows.... BPMR_69 ...i.e. return all timers that had an expiry date in the past but are still “open_pending_complete” and the  composite instance is still “running”. Failed Timers The other area where timers could be incomplete is in our scenario 3, although the timer completed the transaction which completed it has rolled back. We can query these as follows.... BPMR_70 i.e. return all timers that are still “open_pending_complete” and the  composite instance is “running with faults”. For our failed scenario 3 we can see the results of this query.... BPMR_71
Recovery Queries Conclusion
From the above queries it is possible to get a view on what instances have failed, what activity they reached when failure occurred and to where they rolled back. With this information it is possible to determine whether recovery is possible from a business perspective (idempotency) and to infer patterns from failures to try to minimize re-occurrence.

Leveraging the Recovery API

Before running recovery, you may want to backup your SOA_INFRA database.                                        Briefly, this is the Recovery API example that goes with this blog.... BPMR_72 The previous sections described the SQL queries that find the messages in need of re-submission. Essentially the result of these queries (message_guid) will be fed in to either recover invoke, recover callback, or cancel the message. These APIs are in addition to what’s documented for fault recovery here. BPMR_73

Cancel FYI Task messages in the DLV_Message Table

Extract from “BatchMessageCancelForFYI”... BPMR_74 These cancelled messages will be picked up by the next SOA_INFRA purge assuming that a purge strategy is in place.

Batch Recovery of messages in the DLV_Message Table

Here we recover the message(s) using the “message_guid”, extract from “BatchMessageRecovery”.... BPMR_75

Refreshing Timers

Unlike the examples above, refreshing timers does not leverage the Recovery API. As previously mentioned, timers can be simply refreshed from the Enterprise Manager Console in PS6 and beyond, or with a simple API call to the “refreshAlarmTable” method on the BPMN Service Engine thus.... BPMR_76

Summary

In this four part blog we have taken a deep dive into how the BPM engine handles messages, threads, rollbacks & recovery. Whenever we hear from a customer "my message is stuck" or "I've lost one of my process instances" we should now know where to look and how to recover it. Attached to this blog is the JDev project with all SQL queries to find rolled back messages and all java code to recover them.... InstanceRecoveryExample  

BPM 11g Production Readiness Checklist

$
0
0

Introduction

With the help of the other members of the BPM A-Team (Sushil Shukla, John Featherly, Siming Mu) I have put together a concise list of points that should be at least visited prior to moving into production with BPM 11g. Note that it is BPM 11g specific: although it touches on areas common to other common parts of SOA Suite (such as clustering etc...) it does not cover in any way other products within SOA Suite (such as BPEL, Mediator etc...). The list is by no means exhaustive, just a collection of points that experience has shown need to be understood and validated. Some may not be relevant in all customer projects. The items on the list are left deliberately without explanations - if the item means little or nothing to you, go away and research it, there are a collection of links to useful documents at the end of this blog.

Environmental Checklist – High Level

□           Validate Clustering & failover □            Verify DR □            Verify patching □            Tuning □            Validate load balancer, specifically algorithm and LB health check to soa-infra □            Validate IDM, access paths, security, authentication and authorization □            Research and test major component type and version (JDK - HotSpot, JRockit, 6/7, Browser - IE, Chrome, Firefox)

Testing Checklist

Functional Testing

□            All paths in all processes traversed □            Empty XML elements – Selection Failures □            Failure areas negatively tested □                           Boundary events □                           Event sub-processes □                           Fault policies – retries, manual intervention etc.. □                           Business exceptions □                           LDAP □            Tests for non-BPM code; pojo’s, stored procedures etc. □            OBR functional test (if not under separate lifecycle management) □            Validate dashboards, reports etc.

Non-Functional Testing

□            Load testing – average/peak □            Stress testing – how far can the system be pushed □            Negative testing under load □                           Inbound & outbound failures to simulate increasing load □                           Managed server failure / cluster failure □                           Unavailable infrastructure components – network routers, switches, DB etc. □                           Service unavailability – LDAP etc. □            Engine & workspace performance when DB both sparsely & heavily populated

Operational Checklist

Database

□            DB partitioning implemented □            Purging strategy understood and tested □            Other standard database operations for backup, recovery test etc.

BPM Technical Operations

□            Process instances queried, recovered & continued after failure both individually and in bulk □                           Invokes □                           Callbacks □                           Timers □            Re-Deployment / Versioning of new processes □                           Co-existence □                           Instance patching □                           Instance migration

BPM Business Operations

□            Query individual process instances based on business data – EM / Workspace / scripts etc... □            Alter Flow & alter data

Security

□            Auditing □            Intrusion Detection (PEN)

Configuration Management

□            Release Management, UAT and Promote to Production □            Version control system □            Environment Spec and Configuration definition

Resources

SOA 11g Database Growth Management Strategy Whitepaper FMW 11g Performance & Tuning Guide BPM 11g Performance Tuning FMW 11g Enterprise Deployment Guide FMW SOA/BPM Admin Guide - Partitioning SOA 11g Database Performance

How to emulate 10g Adapter start/stop behaviour by manipulating the InboundThreadCount parameter

$
0
0

Introduction

In 10g, there was a mechanism for suppressing consumption of messages at the Adapter level. That mechanism can not be used in 11g. But there is a way...

Main Article

The way to do this is to set the InboundThreadCount in the appropriate MBean to zero. This will effectively suppress consumption of messages - e.g. from MQ or JMS or whatever. Setting this value to something greater than zero will cause consumption to resume.  Making such a change is dynamic – i.e. no restart required. This situation is easily handled through Enterprise Manager and can be done in either of two ways. 1. Navigate through the System MBean Browser and make the change there 2. Identify the composite, click through the appropriate item in the Services & References section, select the Properties tab, make the property change and Apply Either of these techniques are reasonable when there are very few composites to manage. However, now consider the situation when your installation has many Adapters across many Composites. And, for whatever reason, you need to (effectively) stop or restart the Adapters.  In this case, a programmatic approach will be less cumbersome and more efficient. What we need is a program that uses a control file that defines the Services in the context of the Composite that uses it / them Here's the definition of the control file:-
<?xml version="1.0"?>
<xs:schema version="1.0"
           xmlns:xs="http://www.w3.org/2001/XMLSchema"
           elementFormDefault="qualified">
    <xs:element name="AdapterControl">
        <xs:complexType>
            <xs:sequence>
                <xs:element name="compositeDetail" minOccurs="1" maxOccurs="unbounded">
                    <xs:complexType>
                        <xs:sequence>
                            <xs:element name="composite" type="xs:string"/>
                            <xs:element name="revision" type="xs:string"/>
                            <xs:element name="partition" type="xs:string"/>
                            <xs:element name="service" type="xs:string"/>
                            <xs:element name="location" type="xs:string"/>
                            <xs:element name="application" type="xs:string" default="soa-infra" minOccurs="0"/>
                            <xs:element name="threads" type="xs:positiveInteger" default="1"  minOccurs="0"/>
                        </xs:sequence>
                    </xs:complexType>
                </xs:element>
            </xs:sequence>
        </xs:complexType>
    </xs:element>
</xs:schema>


So that's the schema definition. It's self-explanatory. Now let's look at an actual control file that manages just one Adapter (defined by its Service name) within a specific Composite. 

<?xml version="1.0" encoding="UTF-8"?>
<AdapterControl
    xmlns:xsi='http://www.w3.org/2001/XMLSchema-instance'>
    <compositeDetail>
        <composite>MQResponder</composite>
        <revision>1.1</revision>
        <partition>default</partition>
        <service>MQInbound</service>
        <location>AdminServer</location>
        <application>soa-infra</application>
        <threads>1</threads>
    </compositeDetail>
</AdapterControl>
All of the elements apart from <threads> are needed to identify the MBean that needs to be modified. The MBean object name is complex. Here's an example:-
oracle.soa.config:SCAComposite.SCAService=%SERVICE%,name=AdapterBinding,revision=%REVISION%,partition=%PARTITION%,SCAComposite=\"%COMPOSITE%\",Location=%LOCATION%,label=%LABEL%,j2eeType=SCAComposite.SCAService.SCABinding,Application=%APPLICATION%
Actually, that's not a real example. What you see here is an MBean object name with embedded tokens. The application that I'll present here replaces these tokens prior to performing the lookup. Note that the control XML does not contain the "label". This is derived from the other data provided in the control file. If you review the XSD at this point, you will note that the <compositeDetail> element may repeat. Furthermore, note that the <application> and <threads> elements are optional. The <threads> element is needed where users run with multiple adapter threads. If you only ever have one thread, then you don't need it. When stopping a service, this element is not used (because we're going to set the InboundThreadCount to zero).

How it works

The application usage is as follows:- StopStart stop|start host port username password control_file More than likely, you'll run it something like this:- java -jar AdapterControl.jar start localhost 7001 weblogic welcome1 "c:\users\aknight\my documents\AdapterController.xml" The program does not validate the XML control file against the XSD at runtime. Bad things will happen if the XML structure is non-conformant. But don't worry too much as it can't damage your runtime system. Internally, the application is using the Fabric API (Locator) to work out what the %LABEL% needs to be. Other than that, it's using javax.management.MBeanServerConnection to find and invoke the appropriate method on the MBean. The application distribution is available here. The Java code is included in the JAR. I have also included all dependent JARs in the distribution (so it's rather large). The application is offered as is. It is not a tool offered formally by Oracle.

A quick performance tuning hint for high speed Exalogic SOA performance

$
0
0
This is a very quick observation on a simple performance tuning fix for SOA Suite on Exalogic.

The problem

SOA Suite appears to grind to a halt when a load is imposed upon it, when running on Exalogic. CPU may or may not be spiked on SOA at this point in time. SOA may become completely unresponsive, or just be very slow. You may see 504 gateway timeout errors, servers in doubt in the admin server screen, or other symptoms of a "barely responding" SOA Suite system.

The resolution

Turn ON "Always use keep-alive" for the origin server in Oracle Traffic Director (this defaults to OFF). Information on how to enable it at http://docs.oracle.com/cd/E23389_01/doc.11116/e21036/perf014.htm.
Note This setting is hidden at the bottom of the advanced settings list for a 'route' on recent OTD installations.

The details

The recommended architecture for SOA Suite on Exalogic is to use Oracle Traffic Director to route webservice callouts between SOA Composites. This allows for broad load distribution as well as some resiliency against failure. It also provides the capability to use Infiniband class connection speeds and latency between SOA Composites, which is not a bad thing. Oracle Traffic Director, as detailed in the documentation above, defaults to NOT using HTTP Keep Alive for PUT and POST requests to it's origin servers. For SOA on Exalogic, the SOA Suite servers are considered origin servers. SOA Suite requests are 99% POST requests. This means that every request between composites (if you are using the recommended setup) will generate a NEW HTTP connection at the remote SOA Server. This causes a build up of "stale" connections, which are slow to garbage collect for technical reasons. Eventually, with sufficient load, the pile up will be so great that SOA ends up stuck in a Garbage collection loop and will either throw OutOfMemoryError or slow to a crawl. By changing the Oracle Traffic Director setting to true, Oracle Traffic Director will cache and reuse the same HTTP connection for multiple requests (it actually has a small pool of cached connections), the build up won't happen, and performance will dramatically improve as a result.

BAM Design Considerations for Systems with High Volume of Transactions

$
0
0
Dealing with high volume of transactions in Oracle Business Activity Monitoring (BAM) 11g can be challenging. Making the right design decision is critical for the stability and performance of your system. This post covers the common issues and general design recommendations for Oracle BAM 11g in the context of high volume of transactions. The common issue in the context of  high volume of transactions, is that Active Data processing at the BAM Server Side may not catch up the speed of data received in BAM Active Data Cache (ADC), which in turn may cause slight or serious delays in report refreshing with real-time data. When there are a huge backlog of unprocessed data in BAM, the performance and functionality of the whole system will be significantly impacted. Now, let's take a look at the following report design aspects which are the key factors impacting the scalability and performance, and our recommendations Star Schema with High Volume of Data It is common to use Star Schema as the data model in BAM to store business data. Whenever a message is received by BAM, it will be stored in the schema, and triggers Active Data processing within BAM Server, Under small or median load, say less than 10 - 20 messages per second (the rate of incoming messages received by BAM ADC), Active Data processing can keep up with the incoming data rate, and report should be working as expected. However, if the load goes higher, say over 100 messages per second, Active Data processing will consume higher resources (e.g. CPU, Memory, IO, etc), and the chance of performance degradation will be higher as well. Thus, our main design consideration here is to find out a way to shrink the size and data volumes of the star schema. Recommendations 1. If you have one single Star Schema that contains a lot of look-up and calculated fields, consider breaking the one big Star Schema into multiple smaller schemas. The benefit of having multiple schemas is that the size and data volumes for each schema is reduced, therefore reducing server side contention and Active Data processing time. 2. If the transaction volume is high, say more than 60 transactions per second, consider adding additional pre-processing for the business data before sending them to BAM. For example, you can use Oracle Event Processing (OEP) to apply filters or aggregation for the business data before sending them to BAM. Using this approach, the data volume in BAM will be dramatically reduced. 3. Another approach for dealing with the high volume of data is to use Poll Mode versus Push Mode for BAM dashboards. BAM Dashboards in Poll Mode bypass Active Data Processing and reload itself at a time interval. Under heavy load, this approach consumes less system resources (CPU, Memory, Threads, etc.), and therefore performs better in terms of throughput and response time. The charts below shows the difference between Push and Poll Mode in terms of CPU usage and thread count. CPU and Thread Count - Push Mode jfr03   CPU and Thread Count - Poll Mode jfr04 As you can see, when a report is running in Push Mode, CPU usage is consistently around 20%, and the system creates more threads for handling active data. From the performance perspective, a report in Poll Mode uses less resources and perform better under heavy load. 4. Consider normalizing your current data model to reduce the data volume of a single Data Object. If the main fact table (main Data Object of Star Schema) contains redundant fields that cause the data volume growing dramatically, consider normalizing the data model by moving these fields to separate Data Objects or External Data Objects. If you need to drill down the current view to show the values of these redundant fields, create a new report displaying these fields in Poll Mode and use the drill across feature to link the original view to the new view.    Manipulating Data in Views BAM reports can include data manipulation functions, such as filters, calculations, drill down, drill across, driving, etc. Data manipulation is expensive in BAM, and can seriously impact performance if overused in BAM reports under heavy load. Applying filters, drill down, or drill across in a BAM report will get all report views reloaded. Reloading a report is expensive as it requires querying database, re-compiling internal classes built for evaluating calculated fields, and establishing persistent HTTP connections between client and server. Frequent reloading will increase the changes of getting contentions for accessing server resources. Recommendations 1. Minimize the usage of data manipulating functions if possible. If these functions have to be used, ensure the underling Data Object does not contain high volume of data. In the context of high transaction rate, consider normalizing Data Objects instead of using single Star Schema or using Poll Mode for dashboards. 2. If drill down or drill across function is used in the report design, we recommend that you use separate Data Objects for the main and target view. Using separate Data Objects can help to reduce the data volume, which is the key factor for improving performance under heavy load.  

Unable to start SOA –INFRA if the immediate and deferred audit policy setting “ isActive” parameters were set to the same value

$
0
0

In PS3 (11.1.1.4) and PS4 (11.1.1.5), the SOA-INFRA application will not be able start up when you set both immediate and deferred audit policy MBean attributes to active. This is a known bug (13384305), and there is a patch for PS5 (11.1.1.6) to resolve this issue, and there is also a cumulative patch (18254378) for PS4. If you need a quick workaround to start up the SOA-INFRA application before the patch is fully tested, this blog describes how to find the MBean configuration in the MDS schema and change the value in order to start up the SOA-INFRA application.

When you encountered this issue, the Weblogic console would display the server status as “RUNNING” but SOA-INFRA wouldn’t show up in EM Console. In the SOA server log file, you would see the following exception:

[/WEB-INF/fabric-config-core.xml]: Cannot resolve reference to bean
  'DOStoreFactory' while setting bean property 'DOStoreFactory'; nested
  exception is org.springframework.beans.factory.BeanCreationException: Error
  creating bean with name 'DOStoreFactory' defined in ServletContext resource
  [/WEB-INF/fabric-config-core.xml]: Invocation of init method failed; nested
  exception is java.lang.NullPointerException
  at
  org.springframework.beans.factory.support.BeanDefinitionValueResolver.resolve
  Reference(BeanDefinitionValueResolver.java:275)
  at
  org.springframework.beans.factory.support.BeanDefinitionValueResolver.resolve
  ValueIfNecessary(BeanDefinitionValueResolver.java:104)
  at
  org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.
  applyPropertyValues(AbstractAutowireCapableBeanFactory.java:1245)
  at
  org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.
  populateBean(AbstractAutowireCapableBeanFactory.java:1010)
  at
  org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.
  doCreateBean(AbstractAutowireCapableBeanFactory.java:472)

The audit policy attribute settings are stored in the MDS schema. There are 3 tables we can use to fix this problem: MDS_ATTRIBUTES, MDS_COMPONENTS and MDS_PARTITIONS. The MDS_COMPONENTS table stores the version information; the latest version has the highest value in the COMP_CONTENTID column.

select * from mds_components

UnableToStartSOA-1

The MDS_ATTRIBUTES stores the attribute values of the MBean configuration properties, in this case “audit-config”. As we are only interested in audit-config settings related to the SOA-INFRA application for this issue, we can find the correct partition id for the SOA-INFRA application in MDS_PARTITIONS table (see below):

select * from mds_partitions

UnableToStartSOA-2

 

 

 

To find the latest audit-config attribute values, run the following SQL statement to retrieve the latest audit policy configuration:

SELECT *
FROM MDS_ATTRIBUTES
WHERE ATT_CONTENTID=
(SELECT MAX(COMP_CONTENTID)
FROM MDS_COMPONENTS
WHERE COMP_LOCALNAME = 'audit-config'
)
AND MDS_ATTRIBUTES.ATT_PARTITION_ID=
(SELECT PARTITION_ID FROM MDS_PARTITIONS WHERE PARTITION_NAME='soa-infra'
);

The default configuration is shown below:

UnableToStartSOA-3
To resolve this issue as the workaround, you will need to ensure that one of the active attribute value (either immediate or deferred) is set to “true”, and the other attribute value is set to “false”, then you will be able to proceed to start the SOA-INFRA application.

 

 


Resequencer Health Check

$
0
0

11g Resequencer Health Check

In this Blog we will see a few useful queries to monitor and diagnose the health of resequencer components running in a typical SOA/AIA Environment.

The first query is a snapshot of the current count of Resequencer messages in their various states and group_statuses.

Query1: Check current health of resequencers

select to_char(sysdate,'YYYY-MM-DD HH24:MI:SS') time, gs.status group_status, m.status msg_status, count(1), gs.component_dn 
from mediator_group_status gs, mediator_resequencer_message m
where m.group_id = gs.group_id
and gs.status < = 3
and gs.component_status!=1
group by gs.status, m.status, gs.component_dn
order by gs.component_dn, gs.status;

Table below lists a representative sample output of the above query from a running SOA Environment containing Resequencers collected at 12:04:50

Query 1 sample output

For our analysis, let us collect the same data again after a few seconds

2

Refer to the appendix  for a quick glossary of Resequencer group_status and message_status state values

Let us dive a bit deeper into each of the above state combinations, their counts and what they imply.

1. GRP_STATUS/MSG_STATUS = 0/0 – READY

These show the messages which are ready for processing and eligible to be Locked and processed by the resequencer.  For a healthy system, this number would be quite low as the messages will be locked and processed continuously by the resequencer.  When the messages arriving into the system have stopped, this count should drop to zero.

A high count for this combination would suggest that not enough groups are being locked by the resequencer for the rate at which messages are arriving for processing.  The Mediator property – “Resequencer Maximum Groups Locked” should be adequately increased to lock groups at a higher rate.

Refer here to see how this property can be changed from EM Console

2. GRP_STATUS=0/MSG_STATUS=2 – PROCESSED

This count indicates the number of processed messages. This number will be seen to be growing over time. A Very high count (like > 1 million in the above example) indicates that the Resequencer purge is due and should be run soon to delete the processed messages.

 

  1. 3. GRP_STATUS=0/MSG_STATUS=5 – ABORTED

    This count shows the number of message that are currently manually aborted by the administrator.  Refer here for how Resequencer messages can be aborted using the SOA EM Console.

  1. 4. GRP_STATUS=1/MSG_STATUS=0 – LOCKED

    This combination of states shows the messages within groups which are being currently processed. For a healthy system, this number would be quite low as the messages belonging to locked groups are processed continuously by the Resequencer Worker threads.  When the messages arriving into the system have stopped, this count should drop to zero.

A high count for this combination would suggest that not enough worker threads are available to process the messages for the rate at which groups are locked for processing.  The Mediator property – “Resequencer Worker Threads” should be adequately increased to boost the message processing rate.

Refer here to see how this property can be changed from EM Console

 

5. GRP_STATUS=1/MSG_STATUS=2 – LOCKED

The count against this combination shows the number of messages which are processed for locked groups. This is a transient state and once all messages for the locked groups are processed, these counts change status to GRP_STATUS=0/MSG_STATUS=2

 

6. GRP_STATUS=3 – ERRORED

These show the messages against error’ed groups. These will need to be manually recovered from EM Console or the AIA Resubmission tool. They indicate messages which have failed processing due to various errors. If these messages can be recovered and processed successfully, in which case they transition to state GRP_STATUS=0/MSG_STATUS=2. If the errors are non recoverable, then they can be aborted from the EM Console and they move to GRP_STATUS=0/MSG_STATUS=5.

Refer to my earlier blog here for details on recovery of resequencer errors.

 

Query2: Check ContainerID’s  health

select * from MEDIATOR_CONTAINERID_LEASE ;

Table below shows a sample output for the above query from a 2 node clustered SOA installation.

3

 

 

It shows that time when both the nodes last renewed their mediator containerids. These containerid renewals serve as heartbeats for the mediator/Resequencer. It is vital in maintaining the load balance of messages among the nodes and failover of groups/messages that were allocated to expired nodes.


Query3: Load Balance between cluster nodes

select to_char(sysdate,'YYYY-MM-DD HH24:MI:SS') time, gs.container_id container, gs.status group_status, m.status msg_status, count(1)
from mediator_group_status gs, mediator_resequencer_message m
where m.group_id = gs.group_id
and   gs.status  in (0,1)
and component_status!=1 
group by  gs.container_id, gs.status, m.status
order by gs.container_id, gs.status;

The above query can be used to monitor the load balance of messages between nodes of a cluster. Sample output below shows an output for a 2 node clustered SOA environment.

4

This sample output shows the messages of ready and locked messages are roughly evenly distributed across the cluster. If a major skewness is observed for a specific container, then further analysis may be required. Thread dumps and Diagnostic logs of the slower node may indicate towards the cause of the skewness.

 

Appendix:

Below table lists the important status values of MEDIATOR_GROUP_STATUS and MEDIATOR_GROUP_STATUS tables and how the values can be interpreted.

6 5

White Paper on Message Sequencing Patterns using Oracle Mediator Resequencer

$
0
0

One of the consequences of Asynchronous SOA-based integration patterns is that it does not guarantee that messages will reach their destination in the same sequence as initiated at the source.

Ever faced an integration scenario where

- an update order is processed in the integration layer before the create order?

- the target system cannot process two orders for the same customer?

Common fixes used in the field include

- Singleton BPEL implementations, singleton JCA adapters, custom sequencing logic using tables etc.

These common ‘fixes’ often result in performance bottlenecks since all messages are usually funneled through a single threaded component. These approaches also become unreliable and counter-productive when used in clustered deployments. Error scenarios can also cause unexpected behavior.

To address the sequencing requirement without these shortcomings, Oracle SOA Suite provides the Mediator Resequencer component that allows you to build/rebuild a sequence from an out-of-sequence set of input messages. The Resequencer enforces sequential processing of related messages and performs parallel processing of unrelated messages, thereby keeping up the performance levels.

The white paper below aims to provide a common set of use cases for using a Resequencer, Resequencer modes, best practices, configurations, handling error scenarios, HA, failover, etc.

Oracle Mediator Resequencer.pdf

How to Recover Initial Messages (Payload) from SOA Audit for Mediator and BPEL components

$
0
0

Introduction

In Fusion Applications, the status of SOA composite instances are either running, completed, faulted or staled. The composite instances become staled immediately (irrespective of current status) when the respective composite is redeployed with the same version. The messages (payload) are stored in SOA audit tables until they are purged. The users can go through Enterprise Manager and view audit trails and respective messages of each composite. This is good for debugging composite instances. However there are situations where you want to re-submit initiation of SOA composite instances in bulk for the following reasons:

  • The composite was redeployed with the same version number that resulted in all respective instances (completed successfully, faulted or in-flight) becoming stale (“Staled” status)
  • Instances failed because down-stream applications failed and the respective composite did not have an ability to capture the initial message in persistence storage to retry later

In these cases, it may be necessary to capture the initial message (payload) of many instances in bulk to resubmit them. This can be managed programmatically through SOA Facade API. The Facade API is part of Oracle SOA Suite’s Infrastructure Management Java API that exposes operations and attributes of composites, components, services, references and so on. As long as instances are not purged, the developer can leverage SOA Facade API to retrieve initial messages of either Mediator or BPEL components programmatically. The captured messages can be either resubmitted immediately or stored in persistence storage, such as file, jms or database, for later submission. There are several samples, but this post takes the approach of creating a SOA composite that provides the ability to retrieve initial message of Mediator or BPEL components. The sample provides the frame work and you can tailor it to your requirements.

Main Article

SOA Facade API

Please refer to this for complete SOA Facade API documentation. The SOA audit trails and messages work internally as follows:

  • The “Audit Level” should be either Production or Development to capture the initial payload
  • The “Audit Trail Threshold” determines the location of the initial payload.  If the threshold is exceeded, the View XML link is shown in the audit trail instead of the payload. The default value is 50,000 bytes. These large payloads are stored in a separate database table: audit_details.

Please refer to the following document for more details on these properties.

Since the SOA composite we are developing will be deployed in the same respective SOA Server, you do not require user credentials to create the locator object. This is all you need:

Locator locator = LocatorFactory.createLocator();

Please see the SOA Facade API document for more information the Locator class.

Once the Locator object is created, you can lookup composites and apply various filters to narrow down the search to respective components. This is all explained in detail with examples in the SOA Facade document. Here, we focus on how to retrieve the initial messages of the Mediator and BPEL components to resubmit them.

How to retrieve initial payload from BPEL?

In BPEL, the initial payload is either embedded in the audit trail or has a link to it. This is controlled by the audit trail threshold value. If the payload size exceeds the audit threshold value then the audit trail has a link. This is the main method to get audit trail:

auditTrailXml = (String)compInst.getAuditTrail
/* The “compInst” is an instance Component that is derived from: */
Component lookupComponent = (Component)locator.lookupComponent(componentName);
ComponentInstanceFilter compInstFilter = new ComponentInstanceFilter(); compInstFilter.setId(componentId);

 

If the payload size exceeds the audit threshold value, then the actual payload is an XML link that is stored in the “audit_details” table. The following is the API facade to get it:

auditDetailXml = (String)locator.executeComponentInstanceMethod(componentType +”:”+ componentId, auditMethod, new String[]{auditId});

The “auditId” for SOA is always “0″.

 

How to retrieve initial payload from Mediator

The initial payload in Mediator is never embedded in the Audit Trail. It is always linked and the syntax is similar to BPEL (where payload size exceeds the audit threshold value). However, the “auditID” is in the Mediator audit trail and it must be parsed to get that value for the initial payload. This is the code snippet to get the “auditId” from Mediator audit trail:

if (componentType.equals("mediator")) {
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document document = db.parse(new InputSource(new StringReader(auditTrailXml)));
NodeList nodeList = document.getElementsByTagName("event");
String attribute = nodeList.item(0).getAttributes().getNamedItem("auditId").getNodeValue();
addAuditTrailEntry("The Audit is: " + attribute); 
auditId = attribute;auditMethod="getAuditMessage";} 

/* Once you have the "auditID" from above code, the syntax to get the initial payload is the same as in BPEL.*/
auditDetailXml = (String)locator.executeComponentInstanceMethod(componentType +":"+ componentId, auditMethod, new String[]{auditId});

 

Complete Java embedded code in BPEL

try { 
String componentInstanceID = new Long(getInstanceId()).toString();    
addAuditTrailEntry("This Run time Component Instance ID is "+componentInstanceID);  

XMLElement compositeNameVar = (XMLElement) getVariableData("inputVariable", "payload", "/client:process/client:compositeName");
String compositeName = compositeNameVar.getTextContent();  

XMLElement compositeIdVar = (XMLElement) getVariableData("inputVariable", "payload", "/client:process/client:compositeId");
String compositeId = compositeIdVar.getTextContent();  

XMLElement componentTypeVar = (XMLElement) getVariableData("inputVariable", "payload", "/client:process/client:componentType");
String componentType = componentTypeVar.getTextContent();  

XMLElement componentNameVar = (XMLElement) getVariableData("inputVariable", "payload", "/client:process/client:componentName");
String componentName = componentNameVar.getTextContent();  

XMLElement componentIdVar = (XMLElement) getVariableData("inputVariable", "payload", "/client:process/client:componentId");
String componentId = componentIdVar.getTextContent();  

String auditDetailXml = "null";
String auditTrailXml = "null";
String auditMethod = "getAuditDetails";
String auditId = "0";

addAuditTrailEntry("The lookup Composite Instance Name is "+compositeName);  
addAuditTrailEntry("The lookup Composite Instance ID is "+compositeId);  
addAuditTrailEntry("The lookup Component Instance Name is "+componentName);
addAuditTrailEntry("The lookup Component Instance Type is " + componentType);
addAuditTrailEntry("The lookup Component Instance ID is "+componentId);  

Locator locator = LocatorFactory.createLocator();  
Composite composite = (Composite)locator.lookupComposite(compositeName);  
Component lookupComponent = (Component)locator.lookupComponent(componentName);  

ComponentInstanceFilter compInstFilter = new ComponentInstanceFilter();  

compInstFilter.setId(componentId);

List<ComponentInstance> compInstances = lookupComponent.getInstances(compInstFilter);  
if (compInstances != null) {  
    addAuditTrailEntry("====Audit Trail of Instance===");  
    for (ComponentInstance compInst : compInstances) {  
        String compositeInstanceId = compInst.getCompositeInstanceId(); 
        String componentStatus = compInst.getStatus(); 
        addAuditTrailEntry("Composite Instance ID is "+compositeInstanceId);  
        addAuditTrailEntry("Component Status is "+componentStatus);  

        addAuditTrailEntry("Get Audit Trail");
        auditTrailXml = (String)compInst.getAuditTrail();

        if (componentType.equals("mediator")) {
            DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
            DocumentBuilder db = dbf.newDocumentBuilder();
            Document document = db.parse(new InputSource(new StringReader(auditTrailXml)));
            NodeList nodeList = document.getElementsByTagName("event");
            String attribute = nodeList.item(0).getAttributes().getNamedItem("auditId").getNodeValue();
            addAuditTrailEntry("The Audit is: " + attribute);

            auditId = attribute;
            auditMethod="getAuditMessage";
            }

        addAuditTrailEntry("Received Audit Trail");

        addAuditTrailEntry("Get Audit Details of: "+ componentType +":"+ componentId + "for auditId: " + auditId);

        try {
            auditDetailXml = (String)locator.executeComponentInstanceMethod(componentType +":"+ componentId, auditMethod, new String[]{auditId});
        } catch (Exception e) { 
        addAuditTrailEntry("Exception in getting audit details:" + e);
        }

        addAuditTrailEntry("Received Audit Details");

        setVariableData("auditTrailString", "payload", "/client:AuditTrailString/client:auditTrail", auditTrailXml);
        setVariableData("auditDetailString", "payload", "/client:AuditDetailString/client:auditDetail", auditDetailXml);

        addAuditTrailEntry("BPEL Variables set");
    }  
} 

} catch (Exception e) { 
    addAuditTrailEntry("Exception in getting Audit Trails and Details"); 
}

The sample payload to run above composite is:

    <element name="process">
        <complexType>
            <sequence>
                <element name="compositeName" type="string"/>
                                <element name="compositeId" type="string"/>
                                <element name="componentType" type="string"/>
                                <element name="componentName" type="string"/>
                                <element name="componentId" type="string"/>
            </sequence>
        </complexType>
    </element>

Sample Code

Please get the complete Jdeveloper Project as follows:

1. DummySOAApplication to retrieve initial payload of Mediator and BPEL components

2. The SOA Audit Trail Composite “SOAAuditTrails“ that contains the logic to get initial payload of “Dummy Composite”

3. Sample Payload “SOA_audit_payload

 

 

Mediator Parallel Routing Rules

$
0
0

Introduction

In 11g, mediator executes routing rules either sequentially or in parallel.  If you are planning to use parallel routing rules, you would need to understand how the mediator queues and evaluates routings in parallel in different threads. This article describes 2 different threads used in the parallel routing rules and the design consideration if you are planning to use parallel routing rules in your implementation.

mediator-parallel-routing-rules-1

Main Article

By using parallel routing rules, services can be designed to be truly asynchronous. However, the service engine executes these requests in a rather unique fashion. Let’s say that the average time taken to complete execution of each service is one second. If you were to receive 10 requests on the first Mediator service and the Mediator Service Engine is configured with 10 threads, the expectation is that all requests would be completed within one second. However, that is not the case. Let me elaborate further.

Let’s say you have 3 mediator services deployed to your SOA server and each of these has a single parallel routing rule. When the mediator service received a message, the message would be inserted in Mediator store by the dispatcher, its message metadata would be written to the MEDIATOR_DEFERRED_MESSAGE table, and the payload goes into MEDIATOR_PAYLOAD table. All these would occur on the original calling thread.

The mediator service engine has one thread called Locker thread. The locker thread would surface the message metadata from the MEDIATOR_DEFERRED_MESSAGE table into an internal memory queue. The locker thread does this in its own transaction. The MEDIATOR_DEFERRED_MESSAGE table also has a state listed below:
0 – READY
1 – LOCKED
2 – COMPLETED SUCCESSFULLY
3 – FAULTED

Hence, it is important to understand how the locker thread works behind the scene, as this would affect your design decision:

The locker thread has an algorithm to process the message stored by the dispatcher, and there is only 1 locker thread per managed server. After the dispatcher has stored the message data in the MEDIATOR_DEFERRED_MESSAGE table, the locker thread will lock message(s) that has a state=“0”, the number of message to be locked by the locker thread will be dependent on the Parallel Maximum Rows Retrieved parameter that you have set in the EM->SOA-INFRA->SOA Administration->Mediator Properties. The locker thread will cycle through 1 mediator component at a time and check if there are any requests to process from the internal queue. It will process the message(s) by changing its state to “1” for that component for processing, and then sleep for the configured interval defined in the “parallel locker thread sleep” setting before moving on to the next mediator component. If it finds no message to process, it would move on to the next, and the next, until it loops back to the first composite, where it would then process its next request.

mediator-parallel-routing-rules-2

For example: If there are 3 mediator components m1, m2, m3 with priority 1, 2, 3 then the algorithm goes as lock m1 -> sleep -> m2 -> sleep -> m3 -> sleep -> m2 -> sleep -> m3 -> sleep -> m3, in 6 iterations of locker thread, m1 messages are locked once, m2 messages are locked twice and m3 messages are locked 3 times as per the priority. All this happen in a single Locker cycle. Only after the locker thread locks the retrieved messages will they be queued for the worker threads to process the messages. So if you have many mediator components (e.g. 50 mediator components) with parallel routing rule, it will take a considerable amount of time for the locker thread to lock the message to complete one locker cycle. If you have mediator components with lower priority, it will take a longer time for the locker thread to lock the message for the low priority mediator component. The locker cycle will be reset if you undeploy or deploy a new mediator component with parallel routing rules, this is to ensure the mediator component with higher priority will be processed in the next cycle.  You will be able to observe these behaviors when you set logging level to the Trace:32 FINEST level in Oracle Enterprise Manager Fusion Middleware Control.

• oracle.soa.mediator.common
• oracle.soa.mediator.service
• oracle.soa.mediator.dispatch
• oracle.soa.mediator.serviceEngine

Example of Trace:32 diagnostic log:

[2013-12-02T16:06:21.696-07:00] [WLS_SOA1] [TRACE:32] [] [oracle.soa.mediator.common.listener] [tid: Workmanager: , Version: 0, Scheduled=false, Started=false, Wait time: 0 ms\n] [userId: <anonymous>] [ecid: a88161838353ad87:-406b0e26:142b4bfd15c:-8000-0000000000000004,0] [SRC_CLASS: oracle.tip.mediator.common.listener.DBLocker] [APP: soa-infra] [SRC_METHOD: run] Locker running
[2013-12-02T16:06:21.697-07:00] [WLS_SOA1] [TRACE:32] [] [oracle.soa.mediator.common.listener] [tid: Workmanager: , Version: 0, Scheduled=false, Started=false, Wait time: 0 ms\n] [userId: <anonymous>] [ecid: a88161838353ad87:-406b0e26:142b4bfd15c:-8000-0000000000000004,0] [SRC_CLASS: oracle.tip.mediator.common.listener.DBLocker] [APP: soa-infra] [SRC_METHOD: lockMessages] Trying to obtain locks
[2013-12-02T16:06:21.697-07:00] [WLS_SOA1] [TRACE:32] [] [oracle.soa.mediator.common] [tid: Workmanager: , Version: 0, Scheduled=false, Started=false, Wait time: 0 ms\n] [userId: <anonymous>] [ecid: a88161838353ad87:-406b0e26:142b4bfd15c:-8000-0000000000000004,0] [SRC_CLASS: oracle.tip.mediator.common.JTAHelper] [APP: soa-infra] [SRC_METHOD: beginTransaction] Transaction  begins
[2013-12-02T16:06:21.697-07:00] [WLS_SOA1] [TRACE:32] [] [oracle.soa.mediator.common] [tid: Workmanager: , Version: 0, Scheduled=false, Started=false, Wait time: 0 ms\n] [userId: <anonymous>] [ecid: a88161838353ad87:-406b0e26:142b4bfd15c:-8000-0000000000000004,0] [SRC_CLASS: oracle.tip.mediator.common.JTAHelper] [APP: soa-infra] [SRC_METHOD: getTransactionStatus] TransactionManager status
[2013-12-02T16:06:21.697-07:00] [WLS_SOA1] [TRACE:32] [] [oracle.soa.mediator.common] [tid: Workmanager: , Version: 0, Scheduled=false, Started=false, Wait time: 0 ms\n] [userId: <anonymous>] [ecid: a88161838353ad87:-406b0e26:142b4bfd15c:-8000-0000000000000004,0] [SRC_CLASS: oracle.tip.mediator.common.JTAHelper] [APP: soa-infra] [SRC_METHOD: getTransactionStatus] Getting Transaction status
[2013-12-02T16:06:21.697-07:00] [WLS_SOA1] [TRACE:32] [] [oracle.soa.mediator.common] [tid: Workmanager: , Version: 0, Scheduled=false, Started=false, Wait time: 0 ms\n] [userId: <anonymous>] [ecid: a88161838353ad87:-406b0e26:142b4bfd15c:-8000-0000000000000004,0] [SRC_CLASS: oracle.tip.mediator.common.JTAHelper] [APP: soa-infra] [SRC_METHOD: beginTransaction] TransactionManager begin
[2013-12-02T16:06:21.697-07:00] [WLS_SOA1] [TRACE:32] [] [oracle.soa.mediator.common] [tid: Workmanager: , Version: 0, Scheduled=false, Started=false, Wait time: 0 ms\n] [userId: <anonymous>] [ecid: a88161838353ad87:-406b0e26:142b4bfd15c:-8000-0000000000000004,0] [SRC_CLASS: oracle.tip.mediator.common.JTAHelper] [APP: soa-infra] [SRC_METHOD: getTransactionStatus] TransactionManager status
[2013-12-02T16:06:21.697-07:00] [WLS_SOA1] [TRACE:32] [] [oracle.soa.mediator.common] [tid: Workmanager: , Version: 0, Scheduled=false, Started=false, Wait time: 0 ms\n] [userId: <anonymous>] [ecid: a88161838353ad87:-406b0e26:142b4bfd15c:-8000-0000000000000004,0] [SRC_CLASS: oracle.tip.mediator.common.JTAHelper] [APP: soa-infra] [SRC_METHOD: getTransactionStatus] Getting Transaction status
[2013-12-02T16:06:21.697-07:00] [WLS_SOA1] [TRACE:32] [] [oracle.soa.mediator.dispatch.db] [tid: Workmanager: , Version: 0, Scheduled=false, Started=false, Wait time: 0 ms\n] [userId: <anonymous>] [ecid: a88161838353ad87:-406b0e26:142b4bfd15c:-8000-0000000000000004,0] [SRC_CLASS: oracle.tip.mediator.dispatch.db.DeferredDBLocker] [APP: soa-infra] [SRC_METHOD: lock] Obtaining locks for max rows 200
[2013-12-02T16:06:21.697-07:00] [WLS_SOA1] [TRACE:32] [] [oracle.soa.mediator.dispatch.db] [tid: Workmanager: , Version: 0, Scheduled=false, Started=false, Wait time: 0 ms\n] [userId: <anonymous>] [ecid: a88161838353ad87:-406b0e26:142b4bfd15c:-8000-0000000000000004,0] [SRC_CLASS: oracle.tip.mediator.dispatch.db.DeferredDBLocker] [APP: soa-infra] [SRC_METHOD: lock] Removing ABCS/MediatorTest!1.0*soa_e302c3ae-9b29-4bd5-802a-6052389f7be3/Mediator1 from counter 
[2013-12-02T16:06:21.697-07:00] [WLS_SOA1] [TRACE:32] [] [oracle.soa.mediator.dispatch.db] [tid: Workmanager: , Version: 0, Scheduled=false, Started=false, Wait time: 0 ms\n] [userId: <anonymous>] [ecid: a88161838353ad87:-406b0e26:142b4bfd15c:-8000-0000000000000004,0] [SRC_CLASS: oracle.tip.mediator.dispatch.db.DeferredDBLocker] [APP: soa-infra] [SRC_METHOD: lock] Obtaining locks for ABCS/MediatorTest!1.0/Mediator1 and counter 1 at index 0
[2013-12-02T16:06:21.703-07:00] [WLS_SOA1] [TRACE:32] [] [oracle.soa.mediator.common.listener] [tid: Workmanager: , Version: 0, Scheduled=false, Started=false, Wait time: 0 ms\n] [userId: <anonymous>] [ecid: a88161838353ad87:-406b0e26:142b4bfd15c:-8000-0000000000000004,0] [SRC_CLASS: oracle.tip.mediator.common.listener.DBLocker] [APP: soa-infra] [SRC_METHOD: lockMessages] Obtained locks
[2013-12-02T16:06:21.703-07:00] [WLS_SOA1] [TRACE:32] [] [oracle.soa.mediator.common] [tid: Workmanager: , Version: 0, Scheduled=false, Started=false, Wait time: 0 ms\n] [userId: <anonymous>] [ecid: a88161838353ad87:-406b0e26:142b4bfd15c:-8000-0000000000000004,0] [SRC_CLASS: oracle.tip.mediator.common.JTAHelper] [APP: soa-infra] [SRC_METHOD: commitTransaction] Commiting Transaction
[2013-12-02T16:06:21.703-07:00] [WLS_SOA1] [TRACE:32] [] [oracle.soa.mediator.common] [tid: Workmanager: , Version: 0, Scheduled=false, Started=false, Wait time: 0 ms\n] [userId: <anonymous>] [ecid: a88161838353ad87:-406b0e26:142b4bfd15c:-8000-0000000000000004,0] [SRC_CLASS: oracle.tip.mediator.common.JTAHelper] [APP: soa-infra] [SRC_METHOD: getTransactionStatus] TransactionManager status
[2013-12-02T16:06:21.704-07:00] [WLS_SOA1] [TRACE:32] [] [oracle.soa.mediator.common] [tid: Workmanager: , Version: 0, Scheduled=false, Started=false, Wait time: 0 ms\n] [userId: <anonymous>] [ecid: a88161838353ad87:-406b0e26:142b4bfd15c:-8000-0000000000000004,0] [SRC_CLASS: oracle.tip.mediator.common.JTAHelper] [APP: soa-infra] [SRC_METHOD: getTransactionStatus] Getting Transaction status
[2013-12-02T16:06:21.704-07:00] [WLS_SOA1] [TRACE:32] [] [oracle.soa.mediator.common] [tid: Workmanager: , Version: 0, Scheduled=false, Started=false, Wait time: 0 ms\n] [userId: <anonymous>] [ecid: a88161838353ad87:-406b0e26:142b4bfd15c:-8000-0000000000000004,0] [SRC_CLASS: oracle.tip.mediator.common.JTAHelper] [APP: soa-infra] [SRC_METHOD: commitTransaction] TransactionManager commit
[2013-12-02T16:06:21.713-07:00] [WLS_SOA1] [TRACE:32] [] [oracle.soa.mediator.common.listener] [tid: Workmanager: , Version: 0, Scheduled=false, Started=false, Wait time: 0 ms\n] [userId: <anonymous>] [ecid: a88161838353ad87:-406b0e26:142b4bfd15c:-8000-0000000000000004,0] [SRC_CLASS: oracle.tip.mediator.common.listener.DBLocker] [APP: soa-infra] [SRC_METHOD: enqueueLockedMessages] Spining ...........Sleeping for 1000 milliseconds oracle.tip.mediator.dispatch.db.DeferredDBLocker
[2013-12-02T16:06:22.716-07:00] [WLS_SOA1] [TRACE:32] [] [oracle.soa.mediator.common.listener] [tid: Workmanager: , Version: 0, Scheduled=false, Started=false, Wait time: 0 ms\n] [userId: <anonymous>] [ecid: a88161838353ad87:-406b0e26:142b4bfd15c:-8000-0000000000000004,0] [SRC_CLASS: oracle.tip.mediator.common.listener.DBLocker] [APP: soa-infra] [SRC_METHOD: run] Locker running
[2013-12-02T16:06:22.717-07:00] [WLS_SOA1] [TRACE:32] [] [oracle.soa.mediator.common.listener] [tid: Workmanager: , Version: 0, Scheduled=false, Started=false, Wait time: 0 ms\n] [userId: <anonymous>] [ecid: a88161838353ad87:-406b0e26:142b4bfd15c:-8000-0000000000000004,0] [SRC_CLASS: oracle.tip.mediator.common.listener.DBLocker] [APP: soa-infra] [SRC_METHOD: lockMessages] Trying to obtain locks
[2013-12-02T16:06:22.717-07:00] [WLS_SOA1] [TRACE:32] [] [oracle.soa.mediator.common] [tid: Workmanager: , Version: 0, Scheduled=false, Started=false, Wait time: 0 ms\n] [userId: <anonymous>] [ecid: a88161838353ad87:-406b0e26:142b4bfd15c:-8000-0000000000000004,0] [SRC_CLASS: oracle.tip.mediator.common.JTAHelper] [APP: soa-infra] [SRC_METHOD: beginTransaction] Transaction  begins
[2013-12-02T16:06:22.717-07:00] [WLS_SOA1] [TRACE:32] [] [oracle.soa.mediator.common] [tid: Workmanager: , Version: 0, Scheduled=false, Started=false, Wait time: 0 ms\n] [userId: <anonymous>] [ecid: a88161838353ad87:-406b0e26:142b4bfd15c:-8000-0000000000000004,0] [SRC_CLASS: oracle.tip.mediator.common.JTAHelper] [APP: soa-infra] [SRC_METHOD: getTransactionStatus] TransactionManager status
[2013-12-02T16:06:22.717-07:00] [WLS_SOA1] [TRACE:32] [] [oracle.soa.mediator.common] [tid: Workmanager: , Version: 0, Scheduled=false, Started=false, Wait time: 0 ms\n] [userId: <anonymous>] [ecid: a88161838353ad87:-406b0e26:142b4bfd15c:-8000-0000000000000004,0] [SRC_CLASS: oracle.tip.mediator.common.JTAHelper] [APP: soa-infra] [SRC_METHOD: getTransactionStatus] Getting Transaction status
[2013-12-02T16:06:22.717-07:00] [WLS_SOA1] [TRACE:32] [] [oracle.soa.mediator.common] [tid: Workmanager: , Version: 0, Scheduled=false, Started=false, Wait time: 0 ms\n] [userId: <anonymous>] [ecid: a88161838353ad87:-406b0e26:142b4bfd15c:-8000-0000000000000004,0] [SRC_CLASS: oracle.tip.mediator.common.JTAHelper] [APP: soa-infra] [SRC_METHOD: beginTransaction] TransactionManager begin
[2013-12-02T16:06:22.717-07:00] [WLS_SOA1] [TRACE:32] [] [oracle.soa.mediator.common] [tid: Workmanager: , Version: 0, Scheduled=false, Started=false, Wait time: 0 ms\n] [userId: <anonymous>] [ecid: a88161838353ad87:-406b0e26:142b4bfd15c:-8000-0000000000000004,0] [SRC_CLASS: oracle.tip.mediator.common.JTAHelper] [APP: soa-infra] [SRC_METHOD: getTransactionStatus] TransactionManager status
[2013-12-02T16:06:22.717-07:00] [WLS_SOA1] [TRACE:32] [] [oracle.soa.mediator.common] [tid: Workmanager: , Version: 0, Scheduled=false, Started=false, Wait time: 0 ms\n] [userId: <anonymous>] [ecid: a88161838353ad87:-406b0e26:142b4bfd15c:-8000-0000000000000004,0] [SRC_CLASS: oracle.tip.mediator.common.JTAHelper] [APP: soa-infra] [SRC_METHOD: getTransactionStatus] Getting Transaction status
[2013-12-02T16:06:22.717-07:00] [WLS_SOA1] [TRACE:32] [] [oracle.soa.mediator.dispatch.db] [tid: Workmanager: , Version: 0, Scheduled=false, Started=false, Wait time: 0 ms\n] [userId: <anonymous>] [ecid: a88161838353ad87:-406b0e26:142b4bfd15c:-8000-0000000000000004,0] [SRC_CLASS: oracle.tip.mediator.dispatch.db.DeferredDBLocker] [APP: soa-infra] [SRC_METHOD: lock] Obtaining locks for max rows 200
[2013-12-02T16:06:22.717-07:00] [WLS_SOA1] [TRACE:32] [] [oracle.soa.mediator.dispatch.db] [tid: Workmanager: , Version: 0, Scheduled=false, Started=false, Wait time: 0 ms\n] [userId: <anonymous>] [ecid: a88161838353ad87:-406b0e26:142b4bfd15c:-8000-0000000000000004,0] [SRC_CLASS: oracle.tip.mediator.dispatch.db.DeferredDBLocker] [APP: soa-infra] [SRC_METHOD: lock] Removing ABCS/MediatorTest!1.0*soa_b0b789b9-0114-4138-ac50-9331e044af38/Mediator2 from counter 
[2013-12-02T16:06:22.717-07:00] [WLS_SOA1] [TRACE:32] [] [oracle.soa.mediator.dispatch.db] [tid: Workmanager: , Version: 0, Scheduled=false, Started=false, Wait time: 0 ms\n] [userId: <anonymous>] [ecid: a88161838353ad87:-406b0e26:142b4bfd15c:-8000-0000000000000004,0] [SRC_CLASS: oracle.tip.mediator.dispatch.db.DeferredDBLocker] [APP: soa-infra] [SRC_METHOD: lock] Obtaining locks for ABCS/MediatorTest!1.0/Mediator2 and counter 1 at index 0
[2013-12-02T16:06:22.724-07:00] [WLS_SOA1] [TRACE:32] [] [oracle.soa.mediator.common.listener] [tid: Workmanager: , Version: 0, Scheduled=false, Started=false, Wait time: 0 ms\n] [userId: <anonymous>] [ecid: a88161838353ad87:-406b0e26:142b4bfd15c:-8000-0000000000000004,0] [SRC_CLASS: oracle.tip.mediator.common.listener.DBLocker] [APP: soa-infra] [SRC_METHOD: lockMessages] Obtained locks
[2013-12-02T16:06:22.724-07:00] [WLS_SOA1] [TRACE:32] [] [oracle.soa.mediator.common] [tid: Workmanager: , Version: 0, Scheduled=false, Started=false, Wait time: 0 ms\n] [userId: <anonymous>] [ecid: a88161838353ad87:-406b0e26:142b4bfd15c:-8000-0000000000000004,0] [SRC_CLASS: oracle.tip.mediator.common.JTAHelper] [APP: soa-infra] [SRC_METHOD: commitTransaction] Commiting Transaction
[2013-12-02T16:06:22.724-07:00] [WLS_SOA1] [TRACE:32] [] [oracle.soa.mediator.common] [tid: Workmanager: , Version: 0, Scheduled=false, Started=false, Wait time: 0 ms\n] [userId: <anonymous>] [ecid: a88161838353ad87:-406b0e26:142b4bfd15c:-8000-0000000000000004,0] [SRC_CLASS: oracle.tip.mediator.common.JTAHelper] [APP: soa-infra] [SRC_METHOD: getTransactionStatus] TransactionManager status
[2013-12-02T16:06:22.724-07:00] [WLS_SOA1] [TRACE:32] [] [oracle.soa.mediator.common] [tid: Workmanager: , Version: 0, Scheduled=false, Started=false, Wait time: 0 ms\n] [userId: <anonymous>] [ecid: a88161838353ad87:-406b0e26:142b4bfd15c:-8000-0000000000000004,0] [SRC_CLASS: oracle.tip.mediator.common.JTAHelper] [APP: soa-infra] [SRC_METHOD: getTransactionStatus] Getting Transaction status
[2013-12-02T16:06:22.724-07:00] [WLS_SOA1] [TRACE:32] [] [oracle.soa.mediator.common] [tid: Workmanager: , Version: 0, Scheduled=false, Started=false, Wait time: 0 ms\n] [userId: <anonymous>] [ecid: a88161838353ad87:-406b0e26:142b4bfd15c:-8000-0000000000000004,0] [SRC_CLASS: oracle.tip.mediator.common.JTAHelper] [APP: soa-infra] [SRC_METHOD: commitTransaction] TransactionManager commit
[2013-12-02T16:06:22.735-07:00] [WLS_SOA1] [TRACE:32] [] [oracle.soa.mediator.common.listener] [tid: Workmanager: , Version: 0, Scheduled=false, Started=false, Wait time: 0 ms\n] [userId: <anonymous>] [ecid: a88161838353ad87:-406b0e26:142b4bfd15c:-8000-0000000000000004,0] [SRC_CLASS: oracle.tip.mediator.common.listener.DBLocker] [APP: soa-infra] [SRC_METHOD: enqueueLockedMessages] Spining ...........Sleeping for 1000 milliseconds oracle.tip.mediator.dispatch.db.DeferredDBLocker

 

After the locker thread locked the message, the worker thread will retrieve the message from the in memory queue and process the message. The number of worker thread can be tuned by changing the Parallel Worker Threads property in EM->SOA-INFRA->SOA Administration->Mediator Properties. Once the message is processed, the worker thread will change the state of the message to either “2” –Completed successfully or “3” – Faulted.

The engine was designed in order to prevent starving of threads caused by load on a single composite. What the engine wants to avoid is that, if you have a Mediator service that has received hundreds of thousands of requests and another one having received two requests, each service is given a fair amount of time to be serviced, otherwise the two requests may have to wait for hours to execute. Thus, the three settings to consider in asynchronous Mediator services are the following:

  • The Parallel Locker Thread Sleep setting: This is defined at the Mediator Service Engine level
  • The number of threads allocated to the Mediator Service Engine: This is defined by the Parallel Worker Threads parameter
  • The Priority property: Which is set at design time and applicable only to parallel routing rules

Another important point to note is when a mediator service engine is started, it registers itself in the database table called MEDIATOR_CONTAINERID_LEASE and gets a container ID. This is important because when the row is inserted into the MEDIATOR_DEFERRED_MESSAGE table, it round-robins’s the deferred message to one of its containers, the engine will then assigns the ID of the container that should process the message.

Hence, I would strongly suggest and recommend that you would take into account the list of design considerations that I have listed below when you are designing your composite using mediator parallel routing rules:

  • The priority property is only applicable to the parallel routing rules, so you need to consider the mediator component priority base on your business requirement.
  • The locker thread will cycle through all mediator components with parallel routing rules deployed in your environment regardless of whether it has been retired or shutdown.
  • Use sequential routing rules if latency is important and that you are expecting the message to be processed without delay.
  • If you have well over 100 parallel mediator components deployed in your environment, the time to complete the locker cycle grew exponentially and could not be further tuned because there is only 1 locker thread and the lowest parallel locker thread sleep time that you can set is 1 second.
  • If you have a mediator component that contain both sequential and parallel routing rules, sequential routing rules will be executed before parallel routing rules.
  • Fault policy is only applicable to parallel routing rules only. For sequential routing rules, the fault goes back to the caller and it is the responsibility of the caller to handle the fault. If the caller is an adapter, then you can define rejection handlers on the inbound adapter to take care of the errored out messages, that is, the rejected messages.

Improve SSL Support for Your WebLogic Domains

$
0
0

Introduction

Every WebLogic Server installation comes with SSL support. But for some reason many installations get this interesting error message at startup:

Ignoring the trusted CA certificate “CN=Entrust Root Certification Authority – G2,OU=(c) 2009 Entrust, Inc. – for authorized use only,OU=See www.entrust.net/legal-terms,O=Entrust, Inc.,C=US”. The loading of the trusted certificate list raised a certificate parsing exception PKIX: Unsupported OID in the AlgorithmIdentifier object: 1.2.840.113549.1.1.11.

This looks odd and many people ignore these error messages. However, if your strategy is to show real error messages only, you are quickly looking for a solution. The Internet is full of possible solutions. Some recommend to remove the certificates from the JDK trust store, some recommend to use a different trust store. But is this the best solution and what are the side effects?

Main Article

Our way to the solution starts by understanding the error message. Here it is again.

Ignoring the trusted CA certificate “CN=Entrust Root Certification Authority – G2,OU=(c) 2009 Entrust, Inc. – for authorized use only,OU=See www.entrust.net/legal-terms,O=Entrust, Inc.,C=US”. The loading of the trusted certificate list raised a certificate parsing exception PKIX: Unsupported OID in the AlgorithmIdentifier object: 1.2.840.113549.1.1.11.

The first sentence is the result while the second sentence explains the reason. Looking at the reason, we quickly find the “certificate parsing exception“. But what does “PKIX: Unsupported OID in the AlgorithmIdentifier object: 1.2.840.113549.1.1.11” tell us?

  • PKIX stands for the Public Key Infrastructure (X.509). X.509 is the standard used to export, exchange, and import SSL certificates.
  • OID stands for the Object Identifier. Object Identifiers are globally unique and organized in a hierarchy. This hierarchy is maintained by the standards bodies in every country. Every standards body is responsible for a specific branch and can define and assign entries into the hierarchy.

With this background information we can lookup the number 1.2.840.113549.1.1.11 in the OID Repository (see References for the link) and get this result “iso(1) member-body(2) us(840) rsadsi(113549) pkcs(1) pkcs-1(1) sha256WithRSAEncryption(11)“.

Combining the certificate information in the first sentence and the information from the OID lookup we have the following result:

The certificate from CN=Entrust Root Certification Authority – G2,OU=(c) 2009 Entrust, Inc. – for authorized use only,OU=See www.entrust.net/legal-terms,O=Entrust, Inc.,C=US uses SHA256WithRSAEncryption which is not supported by the JDK!

You will probably see more messages for similar or different encryption algorithms used in other certificates.

The Root Cause

These factors cause this (and similar) error messages:

  • By default the Java Cryptography Extension (JCE), that comes with the JDK, implements only limited strength jurisdication policy files.
  • The default trust store of the JDK that holds this and other certificates can be found in JAVA_HOME/jre/lib/security/cacerts.
  • WebLogic Server versions before 12c come with the Certicomm JSSE implementation. The Certicomm implementation will not be updated because the required JDK already comes with the standard SunJSSE implementation.

The Problem

The Certicomm implementation works perfectly with many SSL certificates but does not support newer and stronger algorithms. Removing certificates from the default trust store or using a new trust store works only if you do not need to install third party certificates, for example from well known Certificate Authorities.

The Solution

To remove these error messages and support newer SSL certificates we have to do these steps:

  • Upgrade the jurisdication policy files with the Java Cryptography Extension (JCE) Unlimited Strength Jurisdiction Policy files. You can download the Unlimites Strength Jurisdication files that fit for your JDK version from the Oracle Technology Network (see References). Follow the installation instructions that come with the distribution.
  • Enable SunJSSE Support in WebLogic Server
    • Login to Weblogic console
    • Go to [Select your Server] -> SSL -> Advance
    • Set “Enable JSSE” to true.
  • Restart your domain completely (including NodeManager)
    • If you start your domains with a WLST script:

      CONFIG_JVM_ARGS=’-Dweblogic.ssl.JSSEEnabled=true -Dweblogic.security.SSL.enableJSSE=true’

    • If you start your domains with the scripts startWebLogic.sh, startManagedServer.sh, or startNodeManager.sh:

      JAVA_OPTIONS=’-Dweblogic.ssl.JSSEEnabled=true -Dweblogic.security.SSL.enableJSSE=true’

Your Java and WebLogic environment is now ready to support newer SSL certificates!

Enjoy!

References

Viewing all 228 articles
Browse latest View live


Latest Images