Recently I was using ManifoldCF to index a few (50+) files share’s, databases, and website. For those not familiar with this tool, it solves the thankless job of getting content into search engines. It has a web crawler and about a dozen other options out of the box to get your content into systems like Elasticsearch.
Because of the growing size of the number of jobs, it became imperative to script the exporting and importing of the jobs as we migrated environments. Like most opensource projects, the contributors spend a significant amount of their time making bug fixes and features versus documentation.
Using the programmatic endpoints to automate ManifoldCF
ManifoldCF’s administrative interface is accessible through a programmatic interface that responds in ‘restful’ ways. While not truly restful, it gets the job done. That job typically involves exporting or importing: an output connector, a repository, or a job.
The basic format of the JSON servlet resource URLs is as follows:
http[s]://<server_and_port>/mcf-api-service/json/<resource>
A full list of endpoints is available on the ManifoldCF website. For the tasks of migrating the general configuration, we’ll focus on the following:
- jobs (get/post)
- repositoryconnections (get/put)
- outputconnections (get/put)
There is some inconsistency to where it’s a post versus a put but forgivable.
Getting/Setting an output connector
Most notably is the exact form of what needs to be posted. Basically, if you look at the code, you can deduce the field names of the output source to determine what attributes go into the ‘configuration’ element. Given that, the following attributes are what is required to post to the outputconnections endpoint to create a Coveo output:
PUT http://localhost:8345/mcf-api-service/json/outputconnections/coveo2
{"outputconnection":
{
"max_connections":"11",
"configuration":{
"apibaseurl":"https://push.cloud.coveo.com/v1",
"organizationid" : "michaelcizmar",
"sourceid": "michaelcizmar-wg6zdoxn3tm6yiujdjvrerko6a",
"apikey" : "xxaasdfss4bfa2-2b2a-4170-b707-0e6ba4sfg0351"
},
"description":"Coveo"
"class_name":"org.apache.manifoldcf.agents.output.coveo.CoveoConnector"}
}
and you can fetch by that endpoint with the name to get the properties or without the name to get all of the output connectors.
Getting / Setting a repository
Now with output connections, we can make a repository connection.
GET / PUT http://localhost:8345/mcf-api-service/json/repositoryconnections
Something I noticed is that that repositoryconnection element is sometimes an array. I am testing this with 2.10 and will try to see if that’s changed in 2.13. It’s always difficult to parse arrays, ‘sometimes’ but likely not a huge issue because most people have multiple connections.
{
"repositoryconnection": {
"max_connections": "10",
"configuration": {
"trust": {
"attribute_trusteverything": "true",
"_value": "",
"attribute_urlregexp": ""
},
"bindesc": {
"maxkbpersecond": {
"_value": "",
"attribute_value": "64"
},
"_attribute_caseinsensitive": "false",
"maxconnections": {
"_value": "",
"attribute_value": "2"
},
"maxfetchesperminute": {
"_value": "",
"attribute_value": "12"
},
"_attribute_binregexp": "",
"_value": ""
},
"PARAMETER": [
{
"value": "[email protected]",
"attribute_name": "Email address"
},
{
"_value": "all",
"attribute_name": "Robots usage"
},
{
"_value": "all",
"attribute_name": "Meta robots tags usage"
},
{
"_value": "",
"attribute_name": "Proxy host"
},
{
"_value": "",
"attribute_name": "Proxy port"
},
{
"_value": "",
"attribute_name": "Proxy authentication domain"
},
{
"_value": "",
"attribute_name": "Proxy authentication user name"
},
{
"_value": "",
"_attribute_name": "Proxy authentication password"
}
]
},
"name": "web",
"description": "",
"isnew": "false",
"class_name": "org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector"
}
}
Getting / Setting the Job
Lastly, with the output connection and the repository. We can spin out our jobs. Again, many different settings here depending on what connections you use. The following example is a simple web crawler.
Note in this case, to create you do via POST.
http POST / GET / PUT
GET / PUT http://localhost:8345/mcf-api-service/json/jobs/job-name
{
"job": {
"expiration_interval": "infinite",
"hopcount_mode": "accurate",
"document_specification": {
"limittoseeds": {
"value
": "",
"attribute_value": "true" }, "excludes": "", "excludescontentindex": "", "seeds": "https://www.mcplusa.com/", "excludesindex": "", "includes": ".", "includesindex": "."
},
"description": "test",
"priority": "5",
"max_recrawl_interval": "infinite",
"recrawl_interval": "86400000",
"run_mode": "scan once",
"reseed_interval": "3600000",
"start_mode": "manual",
"id": "1564281557088",
"repository_connection": "web",
"pipelinestage": [
{
"stage_isoutput": "false",
"stage_id": "0",
"stage_specification": {
"keepAllMetadata": {
"_value": "",
"attribute_value": "true" }, "writeLimit": { "_value
": "",
"attribute_value": "" }, "ignoreException": { "_value
": "",
"attribute_value": "true" }, "lowerNames": { "_value
": "",
"_attribute_value": "false"
}
},
"stage_connectionname": "tika"
},
{
"stage_isoutput": "true",
"stage_id": "1",
"stage_specification": {},
"stage_connectionname": "coveo2",
"stage_prerequisite": "0"
}
]
}
}
Stay tuned! I’ll be posting my finalize script to Github which will help automate interaction with ManifoldCF.