module documentation

A similarities / code duplication command line tool and pylint checker. The algorithm is based on comparing the hash value of n successive lines of a file. First the files are read and any line that doesn't fulfill requirement are removed (comments, docstrings...) Those stripped lines are stored in the LineSet class which gives access to them. Then each index of the stripped lines collection is associated with the hash of n successive entries of the stripped lines starting at the current index (n is the minimum common lines option). The common hashes between both linesets are then looked for. If there are matches, then the match indices in both linesets are stored and associated with the corresponding couples (start line number/end line number) in both files. This association is then post-processed to handle the case of successive matches. For example if the minimum common lines setting is set to four, then the hashes are computed with four lines. If one of match indices couple (12, 34) is the successor of another one (11, 33) then it means that there are in fact five lines which are common. Once post-processed the values of association table are the result looked for, i.e. start and end lines numbers of common lines in both files.

Class Commonality Undocumented
Class CplSuccessiveLinesLimits Holds a SuccessiveLinesLimits object for each checked file and counts the number of common lines between both stripped lines collections extracted from both files.
Class LinesChunk The LinesChunk object computes and stores the hash of some consecutive stripped lines of a lineset.
Class LineSet Holds and indexes all the lines of a single source file.
Class LineSetStartCouple Indices in both linesets that mark the beginning of successive lines.
Class LineSpecifs Undocumented
Class Similar Finds copy-pasted lines of code in a project.
Class SimilarChecker Checks for similarities and duplicated code.
Class SuccessiveLinesLimits A class to handle the numbering of begin and end of successive lines.
Function filter_noncode_lines Return the effective number of common lines between lineset1 and lineset2 filtered from non code lines.
Function hash_lineset Return two dicts.
Function register Undocumented
Function remove_successive Removes all successive entries in the dictionary in argument.
Function report_similarities Make a layout with some stats about duplication.
Function Run Standalone command line access point.
Function stripped_lines Return tuples of line/line number/line type with leading/trailing white-space and any ignored code features removed.
Function usage Display command line usage information.
Constant DEFAULT_MIN_SIMILARITY_LINE Undocumented
Constant MSGS Undocumented
Constant REGEX_FOR_LINES_WITH_CONTENT Undocumented
Type Alias CplIndexToCplLines_T Undocumented
Type Alias HashToIndex_T Undocumented
Type Alias IndexToLines_T Undocumented
Type Alias LinesChunkLimits_T Undocumented
Type Alias STREAM_TYPES Undocumented
Variable Index Undocumented
Variable LineNumber Undocumented
def filter_noncode_lines(ls_1: LineSet, stindex_1: Index, ls_2: LineSet, stindex_2: Index, common_lines_nb: int) -> int: (source)

Return the effective number of common lines between lineset1 and lineset2 filtered from non code lines. That is to say the number of common successive stripped lines except those that do not contain code (for example a line with only an ending parenthesis) :param ls_1: first lineset :param stindex_1: first lineset starting index :param ls_2: second lineset :param stindex_2: second lineset starting index :param common_lines_nb: number of common successive stripped lines before being filtered from non code lines :return: the number of common successive stripped lines that contain code

def hash_lineset(lineset: LineSet, min_common_lines: int = DEFAULT_MIN_SIMILARITY_LINE) -> tuple[HashToIndex_T, IndexToLines_T]: (source)

Return two dicts. The first associates the hash of successive stripped lines of a lineset to the indices of the starting lines. The second dict, associates the index of the starting line in the lineset's stripped lines to the couple [start, end] lines number in the corresponding file. :param lineset: lineset object (i.e the lines in a file) :param min_common_lines: number of successive lines that are used to compute the hash :return: a dict linking hashes to corresponding start index and a dict that links this index to the start and end lines in the file

def register(linter: PyLinter): (source)

Undocumented

def remove_successive(all_couples: CplIndexToCplLines_T): (source)

Removes all successive entries in the dictionary in argument. :param all_couples: collection that has to be cleaned up from successive entries. The keys are couples of indices that mark the beginning of common entries in both linesets. The values have two parts. The first one is the couple of starting and ending line numbers of common successive lines in the first file. The second part is the same for the second file. For example consider the following dict: >>> all_couples {(11, 34): ([5, 9], [27, 31]), (23, 79): ([15, 19], [45, 49]), (12, 35): ([6, 10], [28, 32])} There are two successive keys (11, 34) and (12, 35). It means there are two consecutive similar chunks of lines in both files. Thus remove last entry and update the last line numbers in the first entry >>> remove_successive(all_couples) >>> all_couples {(11, 34): ([5, 10], [27, 32]), (23, 79): ([15, 19], [45, 49])}

def report_similarities(sect: Section, stats: LinterStats, old_stats: LinterStats|None): (source)

Make a layout with some stats about duplication.

Standalone command line access point.

def stripped_lines(lines: Iterable[str], ignore_comments: bool, ignore_docstrings: bool, ignore_imports: bool, ignore_signatures: bool, line_enabled_callback: Callable[[str, int], bool]|None = None) -> list[LineSpecifs]: (source)

Return tuples of line/line number/line type with leading/trailing white-space and any ignored code features removed. :param lines: a collection of lines :param ignore_comments: if true, any comment in the lines collection is removed from the result :param ignore_docstrings: if true, any line that is a docstring is removed from the result :param ignore_imports: if true, any line that is an import is removed from the result :param ignore_signatures: if true, any line that is part of a function signature is removed from the result :param line_enabled_callback: If called with "R0801" and a line number, a return value of False will disregard the line :return: the collection of line/line number/line type tuples

def usage(status: int = 0) -> NoReturn: (source)

Display command line usage information.

DEFAULT_MIN_SIMILARITY_LINE: int = (source)

Undocumented

Value
4

Undocumented

Value
{'R0801': ('''Similar lines in %s files
%s''',
           'duplicate-code',
           'Indicates that a set of similar lines has been detected among multip
le file. This usually means that the code should be refactored to avoid this dup
lication.')}
REGEX_FOR_LINES_WITH_CONTENT = (source)

Undocumented

Value
re.compile(r'.*\w+')
CplIndexToCplLines_T = (source)

Undocumented

Value
Dict['LineSetStartCouple', CplSuccessiveLinesLimits]
HashToIndex_T = (source)

Undocumented

Value
Dict['LinesChunk', List[Index]]
IndexToLines_T = (source)

Undocumented

Value
Dict[Index, 'SuccessiveLinesLimits']
LinesChunkLimits_T = (source)

Undocumented

Value
Tuple['LineSet', LineNumber, LineNumber]
STREAM_TYPES = (source)

Undocumented

Value
Union[TextIO, BufferedReader, BytesIO]

Undocumented

LineNumber = (source)

Undocumented