Was the Wild Frontiersman a Prolific Penman? A Stylometric Investigation into the Works of Davy Crockett

The American icon and folk hero David (Davy) Crockett (1785–1836) is known as the legendary bear-fighting “King of the Wild Frontier,” but it is not as widely known that Crockett was also a member of the United States Congress in the 1830s—and an author.

Crockett was first elected as a member from Tennessee in 1832 as a supporter of President Andrew Jackson, his former general in the Indian Wars, but soon after, his support for Jackson began to waver. He opposed both of Jackson’s plans to forcibly relocate Indian tribes and destroy the Second Bank of the United States, while Jackson opposed Crockett’s land bill supporting squatters.

Crockett’s sympathies and politics were aligning with those of the Northeastern Whigs, who believed that Crockett’s popularity would hurt Jackson at the polling booths because he could attract a voter population the Whigs had neglected: the frontiersmen. Not only did the bank provide Crockett money to aid his persistent battles against debt, but the Whig party also promised to support the bills that Crockett proposed. They began to write Crockett’s speeches and circular letters (political letters that were written to reach a larger audience), and to transform Crockett from the uneducated, constantly debt-ridden backwoodsman, into Crockett the frontier statesman.

The Whigs also played a big part in the books published under Crockett’s name, which were used as tools to promote the party’s manifesto. A play indirectly written about Crockett in 1831, titled The Lion of the West, or A Trip to Washington, led to nationwide acknowledgment of Crockett’s name as a writer. Historian James Shackford (1956) claimed that the Whig party spread this narrative with the intention of making “David a more powerful anti-Jackson weapon in the hands of the friends of the U.S. Bank.”

Publications in Question

The Life and Adventures of Colonel David Crockett of West Tennessee is an 1833 biography of Crockett, written to help support his re-election to Congress. The book was copyrighted in the name of James French, an author who was also Crockett’s friend. Shackford attributes Life to Matthew St. Clair Clarke, the clerk of the House of Representatives (1822–33), who was a friend of the president of the Second Bank of the United States. French also wrote another book, Elkswatawa, which includes a character who markedly resembles Crockett.

A Narrative of the Life of David Crockett of the State of Tennessee is an 1834 publication under Crockett’s name that was intended to correct the wrongs of the previous biography. Michael Wallis, in his best-selling 2011 book David Crockett: The Lion of the West, asserts that “much credit for the Crockett autobiography was due to Thomas Chilton, who had an ear for his friend’s vast repertoire of stories laced with idioms and phrases.” Chilton had helped Crockett write legislative documents and speeches, and seems the obvious choice as a collaborator. Crockett dictated his thoughts and memories to Chilton as Narrative took shape. Published in February 1834, Narrative sold at least 10,000 copies that year, which Wallis estimates earned Crockett about $2,000 and helped him pay off some of his enormous debts.

In 1834, while Congress was still in session, Crockett was invited by a group of New England manufacturers to go on an extended tour of the Northeast. He plunged into a three-week book promotional tour of several major eastern cities, in the process missing important votes, floor debates, and other congressional business. When he returned, Crockett schemed with one of his boarding house friends, Pennsylvania Representative William Clark, to get his finances in order by writing yet another book.

Wallis states, “The agreement with Clark was for Crockett to provide him with a collection of newspaper accounts, speeches given on the tour, and any other odd notes and documents that could be organized and cobbled together to form a book.” The result, titled An Account of Col. Crockett’s Tour to the North and Down East in the Year of Our Lord One Thousand Eight Hundred and Thirty Four, contains Crockett’s praise for the “idyllic” conditions he supposedly witnessed for young women working in the factories.

Shackford indicates that Clark was the true author behind this work. Crockett wrote to his publishers that he had “taken 31 pages [of the work] to Mr. Clark to correct and have twelve more ready for him” in a December 21st letter. He further indicates that he “[took] to Mr. Clark 55 pages of [his] new book” in a December 24th letter. Although the book must have been heavily edited due to Crockett’s poor command of written English, the letters do indicate that he had a large part in writing it and in collecting materials used in it.

Even before this book was released, Crockett came up with another idea for his publishers: writing a satirical biography of Martin Van Buren, the probable successor to Andrew Jackson and a man whom Crockett hated. This book was released in June 1835 under the title The Life of Martin Van Buren, Hair(sic)-Apparent to the “Government” and the Appointed Successor of General Jackson. Wallis asserts that “any contribution Crockett made was minimal at best, since it was once again ghostwritten, penned this time by Augustin Smith Clayton, a jurist who represented Georgia in Congress from 1832 to 1835.”

Crockett says in a letter of April 16, 1835, that “I have been looking for a letter from Judge Clayton…I am anxious to hear how he is coming on with the life of Van….” As a fellow Whig, Clayton was also vehemently opposed to Jacksonian policies, and especially his hand-picked successor, Van Buren.

Defeated in the August 1835 House elections, Crockett looked for solace elsewhere and decided to set off to Texas to aid the Texans in their revolt against Mexican rule. He was, famously, one of the defenders at the Alamo, where he either died in the fighting or, according to some sources, was captured and executed after the surrender (in 1975, the diary of a Mexican army officer serving under Santa Anna at the Alamo was translated into English; it claimed that Crockett was one of seven survivors captured and executed by order of the general). Colonel Crockett’s Exploits and Adventures in Texas (1836) is purportedly based on Crockett’s experience during the Alamo, where he met this untimely end.

Given Crockett’s consistently poor writing ability, demonstrated through his letters, and the naturally intense wartime environment, it is unlikely that he was the author of this book. Instead, multiple accounts indicate that author and playwright Richard Penn Smith was behind Texas. Among the sources that corroborate this claim are Edgar Allan Poe, who based his claim on what a personal friend of Smith’s had indicated, and the accounts of Edward Carey, one of the publishers of the book. John Seelye, the editor of On the Alamo: Col. Crockett’s exploits and adventures in Texas, says that “[b]y 1834 the author of Exploits and Adventures had been identified as Richard Penn Smith, a Philadelphia author hired by the publishers Carey and Hart in 1836 to produce a book attributed to Crockett that would exploit his recent martyrdom.”

Shackford claims that the first two chapters of this work are based on two letters Crockett wrote to the publishers, Carey and Hart. The “diary entry” style of these two chapters are not used until the last chapter of the work. Finally, a section written after Crockett’s part of the work ends includes passages taken practically verbatim from a June 9, 1836, letter written from Galveston Bay by a correspondent for the New York Courier and Enquirer. It is possible that the middle part of this work was authored by Smith.

This work deserves extra scrutiny due to its importance in history. According to historian Bill Groenman (1996), author of Eyewitness to the Alamo, this work “has served as an eyewitness account of the Alamo in the past” and used as fact for what actually transpired during that battle, but it is mostly fictitious, and many accounts are inconsistent and contradict each other.

There is, therefore, significant evidence that Crockett did not write the four books published under his name. His ability to write was tainted by his lack of a formal education; he said in his autobiography that he had attended school for only three days in his life and that he was illiterate until his wife taught him how to read and write. A letter Crockett wrote on March 11, 1828, demonstrates his inability to write coherently and grammatically:

“You will excuse me for not writing to you earlyer I did wish to have somthing worth your attention tho it is in vain to wait any longer we are ingaged…”

Comparing the Works

In a paper in CHANCE (1999), David and Dena Salsburg used simple stylometric techniques to compare the Narrative, Tour, and Texas books to a sample of 1,972 words from Crockett’s congressional speeches as recorded in Gales and Seaton’s Register, with works by Nathaniel Hawthorne and James Fenimore Cooper as controls. Employing chi-square techniques on a sample of nine frequently occurring non-contextual function words, they concluded that only the publication Texas did not match his speeches. However, the Salsburgs made the error of not investigating any ghostwriters in their analysis.

Figure 1. Summary of Crockett works and associated ghostwriters.

Stylometric techniques and accessibility to larger corpora have evolved considerably in the succeeding 20 years, justifying a new investigation into the true authorship of Crockett’s works. In addition to the three books the Salsburgs analyzed, we reviewed two more books, Van Buren and Life. This research does not employ speeches listed in the Register of Debates, a congressional record of debates that transpired in Congress, since at that time, the Register was not a verbatim account. Instead, it often paraphrased, rendering it inauthentic to Crockett. This research aims to analyze the works of possible ghostwriters of Crockett’s publications, which will aid in identifying the true authorship of the works.

Publications of Ghostwriters and Controls Included in Analysis

The works of ghostwriters and controls included in this analysis are detailed in Table 1. The first four authors are the controls used: authors whose books are authentic and who can be used to make comparisons with works that have unclear authorship. The use of control authors helps to validate the methods employed and acts as an anchor for the study.

Table 1—All Works Included and Word Count of Sample Splits

After the controls, all the potential ghostwriters and their works are listed, with the year published and a letter that is attributed to each work to identify the book in the analysis diagrams. Each diagram includes the author’s name and the letter to indicate which book is used. If only one publication by an the author is included, only that author’s name and section of the work is shown.


Using stylometry, the statistical analysis of literary style, this study attempts to uncover the true authorship of the works purportedly written by David Crockett. The study uses the “Burrows’ method,” a robust and proven stylometric technique. This method analyzes the N most-common function words in the corpus, computes the rate of occurrence of each of these N words in all the textual samples, and then creates a frequency table that can be analyzed using multivariate statistical techniques.

Authors are proven to use non-contextual words at unique frequencies. These include the more-humble servants of speech such as “as, of, the, at” and are generally short words that, being devoid of context, writers use almost subconsciously. To compare samples of text and strengthen the claims made, the study uses methods of cluster analysis and machine learning, with 60 to 100 of the most-common function words for cluster analysis and up to 270 for the iterative machine learning method.

To account for genre and topic-specific works, the data set has been culled when necessary; this ensures that a predetermined percentage of the corpora contains all of the most-frequent words collected and avoids bias by removing words that are heavily used in certain genres.

Preparing the corpus involved taking 4,000-word chunks of text from the beginning, middle, and end of each publication, which ensures sampling a significant amount of text from each part of the book. The programming language R (specifically using the package “Stylo”) and the stylometric program Vocsoft (written by Dr. Richard Forsyth) created a frequency table of the most-common function words’ frequencies in the texts. Then, R and JMP were employed to analyze these frequencies using multivariate statistical techniques.

In an initial exploratory statistical analysis, different distance measures were computed between the words, most notably standardized Euclidean, standardized Manhattan, and the Classic Delta distances. These were then paired with Ward’s method to create a dendrogram that displays the different works as clusters.

A relatively new method, known as the “Bootstrap Consensus Tree” and introduced to stylometry by Maciej Eder of the Polish Academy of Sciences, was used to analyze the different clusters that appear in the publications based on the most-common non-contextual words. The logic behind the Bootstrap Consensus Tree is that with analysis of only a few inputs of the most-frequent words, accidental similarities may appear; however, with more inputs throughout multiple snapshots of the most-frequent words (iterations of 100, 200, 300…MFWs), more-authentic patterns and groupings will reappear.

This method runs a minimum of four iterations of dendrograms, which are then weighed against each other and combined into a tree with the strongest consensus. Thus, Bootstrap Consensus Trees follow the same logic as clustering using dendrograms, but employ an iterative method that captures a much-stronger and broader picture using patterns across many iterations of the most-frequent words in a textual corpus.

Hierarchy of Cluster Analyses

Initially, the controls were analyzed separately, using the Bootstrap Consensus Tree to test the strength of the method and its application to controls. A bootstrap consensus analysis of the Crockett publications with two controls, titled “within analysis,” tests the authorial signal of the questioned publications themselves and whether more than one hand probably wrote the books.

To conclude the statistical analysis, a dendrogram of the ghostwriters against the Crockett publications looks at whether there is evidence that the ghostwriters identified previously wrote Crockett’s publications.

The bootstrap consensus analysis of the controls in Figure 2 shows that between the four iterations of 60 to 100 function words, the strongest combination of dendrograms correctly attributes each book to the correct authorial clustering. We see three clusters (Hawthorne, Cooper, and Melville), and there is even the case of different books by the same author clustering together more closely than parts of the same book, as seen with both Melville’s Pioneers 1 and Moby Dick 2, and Cooper’s Last of the Mohicans 1 and Deerslayer 1. The data set was not culled because it made no significant impact on the clustering.

Figure 2. Bootstrap Consensus Trees showing analysis of controls.

The bootstrap consensus analysis in Figure 3, which includes only Crockett’s purported publications and two controls (Cooper and Melville), verifies our claims that more than one hand wrote Crockett’s works. Interestingly, we see a split with Narrative, Tour, and Texas 1 through 2.2 on one side of the tree, while the rest of the publications are on a separate side. This is important because Crockett surely contributed to Narrative and Tour, and Texas 1 was simply his edited letters. The works Van Buren and Life appear to differ from this cluster. Culling was not used because it made no significant impact on the analysis.

Figure 3. Within analysis.

The 70 non-contextual words used in the cluster analysis for Figure 4 are: the, and, of, to, a, in, was, that, it, as, for, with, had, on, by, which, but, at, were, is, this, be, not, from, all, have, so, an, when, or, would, been, who, one, there, no, upon, if, its, some, time, out, than, man, more, could, up, now, then, about, will, little, has, what, into, before, such, made, much, any, every, most, great, where, first, over, many, should, only, very, like.

Figure 4. Dendrogram Analysis—within analysis using 70 non-contextual words.

This cluster analysis, and its associated dendrogram shown in Figure 4, gives a strong vindication of the historians’ accounts. It affirms that J.S. French was indeed the likely author of Life, and Augustin Clayton was the likely author of Van Buren.

Further, it shows evidence that while Richard Penn Smith wrote the middle parts of Texas (Texas 2 and Texas 2.1), the very beginning and end—Texas 1 and Texas 3—have unclear authorship.

Since Texas 1 contains a nearly 4,000-word chunk of the first two chapters of the book, which were heavily edited letters Crockett wrote to his publishers, it is noteworthy that Tour clusters with it, implying that either the author of Tour contributed to this section, or there are remnants of Crockett’s style from Tour. Furthermore, Texas 3 clusters with the control, which is understandable as parts of it were taken verbatim from letters not written by any of the ghostwriters or Crockett. The data set was culled to ensure that there was no bias in favor of certain words that were text-related.

Machine Learning Analysis

A machine learning method known as Support Vector Machines (SVMs) was applied to this problem to validate the evidence indicated by the dendrograms. SVMs work particularly well with large data, given their ability to process many thousands of unique inputs, creating a strong tool for stylometry. They may be used to classify and predict patterns from a set of labeled data and outperform most recognized and well-used classifiers.

The general logic behind SVMs is that they classify data based on the production of a discriminant function that minimizes the training error while maximizing the margin that separates the data classes. SVMs find the best hyperplane that separates a set of “positive” examples from a set of “negative” examples, with the margin capturing the maximum interclass difference. The optimal hyperplane can be found through linearly separated patterns, which can be extended to non-linear patterns by transformations to the original data that map into new space, otherwise known as the kernel function. An established methodology to choose the best kernel with the most-effective parameters for specific applications of SVMs does not currently exist, but this study used the linear kernel, since the number of variables (most-frequent words) far surpasses the size of the number of classes (texts).

The Stylo package was used in conducting the SVM analysis of the David Crockett problem. The model was fed known control and ghostwriter texts into a “training set,” while the publications in question and needing attribution were placed in a “secondary set.” To test the robustness of the attribution method and see whether established works map to themselves, the secondary set also included publications not in question.

The machine learning analysis output using support vector machines is outlined in Figure 5. Each iteration shows which author from the training set was attributed to the texts in the secondary set. Furthermore, there is another classification percentage under each iteration. It attributes authors from the training set to their books in the secondary set. The model was strong enough for there not to be misattributions, giving a general attributive success of 100 percent in three iterations to these known works. These data also were culled, since a number of words are based heavily on genre and would have biased the analysis.

Figure 5. Machine learning analysis of ghostwriters vs. Crockett publications.

The data from the machine learning analysis show exactly what the cluster analysis dendrogram in Figure 4 demonstrated: In all three iterations, the technique attributed Life to J.S. French and Van Buren to Augustin Clayton. The results for Texas are also in accordance with Figure 4. Richard Smith, in almost all iterations except for the third, was attributed as writing the middle part of Texas, Texas 2.1, and Texas 2.2. Texas 1 was attributed to the author of Tour, while Texas 3 was arbitrarily attributed to James Fenimore Cooper, who most certainly did not write the book.

Discussion and Conclusion

This study assessed the claims made by historians concerning the authenticity of David Crockett’s publications by using statistical analysis of non-contextual function words. There is no single author of the Crockett works. Using “traditional” methods of authorship attribution, the study identified four possible authors as having written Crockett’s books, outlined in Figure 1, and two authors as having written his biography, Life. Although only three of the six authors had enough published material in book form to be included in the stylometric, “non-traditional,” analysis, this was sufficient because it enabled analysis and attribution of the three most-important works: Life, Van Buren, and Texas. Crockett’s Tour and Narrative were also important publications, but through his own letters, Crockett reveals the authors of these books, while indicating that he also personally contributed to them.

Stylometry has served as a tool to help uncover the likely authors of these works. Using cluster analysis and machine learning classifiers, this research showed that Van Buren was written by Augustin Clayton, Life was written by French, and Texas had mixed authorial signals and was more-complicated than initially expected.

Historian James Shackford demonstrated that the first two chapters of Texas are edited versions of letters sent from Crockett to his publishers, while the very last chapter was written by an unknown author with parts quoted verbatim from a letter written about the Alamo by Crockett. The first part of the text was stylometrically attributed to the author of Tour, which is noteworthy because Crockett indicates in his letters that he was sending papers to the author, “Mr. Clark,” which Clark used to create the book Tour. This might show that Crockett did indeed have a hand in writing the book.

The very last part of the book matched the control James Cooper—certainly not the author, but attributed as the author since Texas 3 is so unlike the rest of the publications and closest to a control.

Finally, Richard Penn Smith was attributed as writing the middle chunk of the book.

This study is the first to use stylometry and the more-advanced methods that the field has produced to attribute authorship to Life, Van Buren, and Texas. Although the authors of Narrative and Tour have been exposed through Crockett’s letters, further confirmatory stylometric analyses should be run on these texts with the development of more-robust cross-genre techniques, since only published letters are available from the ghostwriters of these texts.

Further Reading

Support Vector Machines and their application:

Ayat, N.E., Cheriet, M., and Suen, C.Y. 2005. Automatic Model Selection for the Optimization of SVM Kernels. Pattern Recognition 38: 1733–1745.

Diederich, J., Kindermann, J., Leopold, E., and Paass, G. 2003. Authorship Attribution with Support Vector Machines. SpringerLink 19(1–2), 109–123.

Ebrahimpour, M., Putnins, T., Berryman, M., Allison, A., Ng, B., and Abbot, D. 2013. Automated Authorship Attribution Using Advanced Signal Classification Techniques. San Francisco, CA: PLOS.

Bootstrap Consensus Trees:

Eder, M. 2017. Visualization in stylometry: cluster analysis using networks. Digital Scholarship in the Humanities 32(1): 50–64.


Crockett, David. 2018. The Life of Martin Van Buren. Kindle ed. HardPress.

Crockett, David. 2018. The Life and Adventures of Colonel David Crockett of Tennessee. Kindle ed. HardPress.

Burrows, J.F. 1992. Not unless you ask nicely: the interpretive nexus between analysis and information. Literary and Linguistic Computing 7: 91–109.

Eder, M., Rybicki, J., and Kestemont, M. 2016. Stylometry with R: a package for computational text analysis. R Journal 81, 107–121.

French, James S. 2018. Elkswatawa, Or, The Prophet of the West, A Tale of the Frontier. Kindle ed. HardPress.

Groneman, Bill. 1996. Eyewitness to the Alamo. Republic of Texas Press.

Hawthorne, N., and Myerson, J. 2002. <em.Selected letters of Nathaniel Hawthorne. Columbus, OH: Ohio State University Press.

Holmes, D.I., and Kardos, J. 2003. Who was the author? An introduction to Stylometry. CHANCE 16, 2.

Paulding, J. K. 1954. The lion of the West: Retitled The Kentuckian: Or, A trip to New York: A farce in two acts. Stanford, CA: Stanford University Press.

Salsburg, David. 2017. Errors, Blunders, and Lies. How to tell the Difference. Boca Raton, FL: Chapman and Hall/CRC.

Salsburg, David, and Salsburg, Dena. 1999. Searching for the “Real” Davy Crockett.

Shackford, James Atkins. 1956. David Crockett: The Man and the Legend. John B. Shackford, editor. Chapel Hill: University of North Carolina Press.

Smith, Richard Penn. 2003. On to the Alamo: Colonel Crockett’s Exploits and Adventures in Texas. John Seelye, editor. London, UK: Penguin Group.

Smith, Richard Penn. 2018. The Miscellaneous Works of the Late Richard Penn Smith. Kindle ed. HardPress.

Wallis, M. 2011. <em.David Crockett: The Lion of the West. New York, NY: W.W. Norton and Co.

About the Authors

David Holmes is a professor emeritus in statistics at the College of New Jersey who now teaches in the Department of Statistics at George Mason University. He has conducted research in stylometry for more than 30 years, publishing and presenting extensively both in the U.S. and the UK. He has a PhD in statistics from King’s College, University of London.

Ferris Samara is a recent graduate of George Mason University and currently works as a data analyst at Freddie Mac. He graduated with degrees in economics and data analysis, and worked on this stylometric investigation through an undergraduate research grant. His main research interests are in stylometry, machine learning, and better understanding of how to use Big Data to tackle modern statistical problems.

Back to Top

Tagged as: , , , ,