pylint.checkers.similar

module documentation

(source)

A similarities / code duplication command line tool and pylint checker. The algorithm is based on comparing the hash value of n successive lines of a file. First the files are read and any line that doesn't fulfill requirement are removed (comments, docstrings...) Those stripped lines are stored in the LineSet class which gives access to them. Then each index of the stripped lines collection is associated with the hash of n successive entries of the stripped lines starting at the current index (n is the minimum common lines option). The common hashes between both linesets are then looked for. If there are matches, then the match indices in both linesets are stored and associated with the corresponding couples (start line number/end line number) in both files. This association is then post-processed to handle the case of successive matches. For example if the minimum common lines setting is set to four, then the hashes are computed with four lines. If one of match indices couple (12, 34) is the successor of another one (11, 33) then it means that there are in fact five lines which are common. Once post-processed the values of association table are the result looked for, i.e. start and end lines numbers of common lines in both files.

Class	`Commonality`	Undocumented
Class	`CplSuccessiveLinesLimits`	Holds a SuccessiveLinesLimits object for each checked file and counts the number of common lines between both stripped lines collections extracted from both files.
Class	`LinesChunk`	The LinesChunk object computes and stores the hash of some consecutive stripped lines of a lineset.
Class	`LineSet`	Holds and indexes all the lines of a single source file.
Class	`LineSetStartCouple`	Indices in both linesets that mark the beginning of successive lines.
Class	`LineSpecifs`	Undocumented
Class	`Similar`	Finds copy-pasted lines of code in a project.
Class	`SimilarChecker`	Checks for similarities and duplicated code.
Class	`SuccessiveLinesLimits`	A class to handle the numbering of begin and end of successive lines.
Function	`filter_noncode_lines`	Return the effective number of common lines between lineset1 and lineset2 filtered from non code lines.
Function	`hash_lineset`	Return two dicts.
Function	`register`	Undocumented
Function	`remove_successive`	Removes all successive entries in the dictionary in argument.
Function	`report_similarities`	Make a layout with some stats about duplication.
Function	`Run`	Standalone command line access point.
Function	`stripped_lines`	Return tuples of line/line number/line type with leading/trailing white-space and any ignored code features removed.
Function	`usage`	Display command line usage information.
Constant	`DEFAULT_MIN_SIMILARITY_LINE`	Undocumented
Constant	`MSGS`	Undocumented
Constant	`REGEX_FOR_LINES_WITH_CONTENT`	Undocumented
Type Alias	`CplIndexToCplLines_T`	Undocumented
Type Alias	`HashToIndex_T`	Undocumented
Type Alias	`IndexToLines_T`	Undocumented
Type Alias	`LinesChunkLimits_T`	Undocumented
Type Alias	`STREAM_TYPES`	Undocumented
Variable	`Index`	Undocumented
Variable	`LineNumber`	Undocumented

def filter_noncode_lines(ls_1: LineSet, stindex_1: Index, ls_2: LineSet, stindex_2: Index, common_lines_nb: int) -> int: (source) ¶

Return the effective number of common lines between lineset1 and lineset2 filtered from non code lines. That is to say the number of common successive stripped lines except those that do not contain code (for example a line with only an ending parenthesis) :param ls_1: first lineset :param stindex_1: first lineset starting index :param ls_2: second lineset :param stindex_2: second lineset starting index :param common_lines_nb: number of common successive stripped lines before being filtered from non code lines :return: the number of common successive stripped lines that contain code

def hash_lineset(lineset: LineSet, min_common_lines: int = DEFAULT_MIN_SIMILARITY_LINE) -> tuple[HashToIndex_T, IndexToLines_T]: (source) ¶

Return two dicts. The first associates the hash of successive stripped lines of a lineset to the indices of the starting lines. The second dict, associates the index of the starting line in the lineset's stripped lines to the couple [start, end] lines number in the corresponding file. :param lineset: lineset object (i.e the lines in a file) :param min_common_lines: number of successive lines that are used to compute the hash :return: a dict linking hashes to corresponding start index and a dict that links this index to the start and end lines in the file

def register(linter: PyLinter): (source) ¶

Undocumented

def remove_successive(all_couples: CplIndexToCplLines_T): (source) ¶

Removes all successive entries in the dictionary in argument. :param all_couples: collection that has to be cleaned up from successive entries. The keys are couples of indices that mark the beginning of common entries in both linesets. The values have two parts. The first one is the couple of starting and ending line numbers of common successive lines in the first file. The second part is the same for the second file. For example consider the following dict: >>> all_couples {(11, 34): ([5, 9], [27, 31]), (23, 79): ([15, 19], [45, 49]), (12, 35): ([6, 10], [28, 32])} There are two successive keys (11, 34) and (12, 35). It means there are two consecutive similar chunks of lines in both files. Thus remove last entry and update the last line numbers in the first entry >>> remove_successive(all_couples) >>> all_couples {(11, 34): ([5, 10], [27, 32]), (23, 79): ([15, 19], [45, 49])}

def report_similarities(sect: Section, stats: LinterStats, old_stats: LinterStats|None): (source) ¶

Make a layout with some stats about duplication.

def Run(argv: Sequence[str]|None = None) -> NoReturn: (source) ¶

Standalone command line access point.

def stripped_lines(lines: Iterable[str], ignore_comments: bool, ignore_docstrings: bool, ignore_imports: bool, ignore_signatures: bool, line_enabled_callback: Callable[[str, int], bool]|None = None) -> list[LineSpecifs]: (source) ¶

Return tuples of line/line number/line type with leading/trailing white-space and any ignored code features removed. :param lines: a collection of lines :param ignore_comments: if true, any comment in the lines collection is removed from the result :param ignore_docstrings: if true, any line that is a docstring is removed from the result :param ignore_imports: if true, any line that is an import is removed from the result :param ignore_signatures: if true, any line that is part of a function signature is removed from the result :param line_enabled_callback: If called with "R0801" and a line number, a return value of False will disregard the line :return: the collection of line/line number/line type tuples

def usage(status: int = 0) -> NoReturn: (source) ¶

Display command line usage information.

DEFAULT_MIN_SIMILARITY_LINE: int = (source) ¶

Undocumented

Value

MSGS: dict[str, MessageDefinitionTuple] = (source) ¶

Undocumented

Value

{'R0801': ('''Similar lines in %s files
%s''',
           'duplicate-code',
           'Indicates that a set of similar lines has been detected among multip↵
le file. This usually means that the code should be refactored to avoid this dup↵
lication.')}

REGEX_FOR_LINES_WITH_CONTENT = (source) ¶

Undocumented

Value

re.compile(r'.*\w+')

CplIndexToCplLines_T = (source) ¶

Undocumented

Value

Dict['LineSetStartCouple', CplSuccessiveLinesLimits]

HashToIndex_T = (source) ¶

Undocumented

Value

Dict['LinesChunk', List[Index]]

IndexToLines_T = (source) ¶

Undocumented

Value

Dict[Index, 'SuccessiveLinesLimits']

LinesChunkLimits_T = (source) ¶

Undocumented

Value

Tuple['LineSet', LineNumber, LineNumber]

STREAM_TYPES = (source) ¶

Undocumented

Value

Union[TextIO, BufferedReader, BytesIO]

Index = (source) ¶

Undocumented

LineNumber = (source) ¶

Undocumented