Merge pull request #562 from aliparlakci/development

Serene-Arc · web-flow · commit 8718295ee51b · 2021-11-24T13:17:04.000+10:00
diff --git a/README.md b/README.md
@@ -7,6 +7,8 @@ This is a tool to download submissions or submission data from Reddit. It can be
 
 If you wish to open an issue, please read [the guide on opening issues](docs/CONTRIBUTING.md#opening-an-issue) to ensure that your issue is clear and contains everything it needs to for the developers to investigate.
 
+Included in this README are a few example Bash tricks to get certain behaviour. For that, see [Common Command Tricks](#common-command-tricks).
+
 ## Installation
 *Bulk Downloader for Reddit* needs Python version 3.9 or above. Please update Python before installation to meet the requirement. Then, you can install it as such:
 ```bash
@@ -76,6 +78,9 @@ The following options are common between both the `archive` and `download` comma
   - Can be specified multiple times
   - Disables certain modules from being used
   - See [Disabling Modules](#disabling-modules) for more information and a list of module names
+- `--ignore-user`
+  - This will add a user to ignore
+  - Can be specified multiple times
 - `--include-id-file`
   - This will add any submission with the IDs in the files provided
   - Can be specified multiple times
@@ -208,6 +213,16 @@ The following options are for the `archive` command specifically.
 
 The `clone` command can take all the options listed above for both the `archive` and `download` commands since it performs the functions of both.
 
+## Common Command Tricks
+
+A common use case is for subreddits/users to be loaded from a file. The BDFR doesn't support this directly but it is simple enough to do through the command-line. Consider a list of usernames to download; they can be passed through to the BDFR with the following command, assuming that the usernames are in a text file:
+
+```bash
+cat users.txt | xargs -L 1 echo --user | xargs -L 50 python3 -m bdfr download <ARGS>
+```
+
+The part `-L 50` is to make sure that the character limit for a single line isn't exceeded, but may not be necessary. This can also be used to load subreddits from a file, simply exchange `--user` with `--subreddit` and so on.
+
 ## Authentication and Security
 
 The BDFR uses OAuth2 authentication to connect to Reddit if authentication is required. This means that it is a secure, token-based system for making requests. This also means that the BDFR only has access to specific parts of the account authenticated, by default only saved posts, upvoted posts, and the identity of the authenticated account. Note that authentication is not required unless accessing private things like upvoted posts, saved posts, and private multireddits.
@@ -320,10 +335,14 @@ The BDFR can be run in multiple instances with multiple configurations, either c
 
 Running these scenarios consecutively is done easily, like any single run. Configuration files that differ may be specified with the `--config` option to switch between tokens, for example. Otherwise, almost all configuration for data sources can be specified per-run through the command line.
 
-Running scenarious concurrently (at the same time) however, is more complicated. The BDFR will look to a single, static place to put the detailed log files, in a directory with the configuration file specified above. If there are multiple instances, or processes, of the BDFR running at the same time, they will all be trying to write to a single file. On Linux and other UNIX based operating systems, this will succeed, though there is a substantial risk that the logfile will be useless due to garbled and jumbled data. On Windows however, attempting this will raise an error that crashes the program as Windows forbids multiple processes from accessing the same file.
+Running scenarios concurrently (at the same time) however, is more complicated. The BDFR will look to a single, static place to put the detailed log files, in a directory with the configuration file specified above. If there are multiple instances, or processes, of the BDFR running at the same time, they will all be trying to write to a single file. On Linux and other UNIX based operating systems, this will succeed, though there is a substantial risk that the logfile will be useless due to garbled and jumbled data. On Windows however, attempting this will raise an error that crashes the program as Windows forbids multiple processes from accessing the same file.
 
 The way to fix this is to use the `--log` option to manually specify where the logfile is to be stored. If the given location is unique to each instance of the BDFR, then it will run fine.
 
+## Manipulating Logfiles
+
+The logfiles that the BDFR outputs are consistent and quite detailed and in a format that is amenable to regex. To this end, a number of bash scripts have been [included here](./scripts). They show examples for how to extract successfully downloaded IDs, failed IDs, and more besides.
+
 ## List of currently supported sources
 
   - Direct links (links leading to a file)
diff --git a/bdfr/__main__.py b/bdfr/__main__.py
@@ -17,6 +17,7 @@
     click.option('--authenticate', is_flag=True, default=None),
     click.option('--config', type=str, default=None),
     click.option('--disable-module', multiple=True, default=None, type=str),
+    click.option('--ignore-user', type=str, multiple=True, default=None),
     click.option('--include-id-file', multiple=True, default=None),
     click.option('--log', type=str, default=None),
     click.option('--saved', is_flag=True, default=None),
diff --git a/bdfr/archiver.py b/bdfr/archiver.py
@@ -28,6 +28,11 @@ def __init__(self, args: Configuration):
     def download(self):
         for generator in self.reddit_lists:
             for submission in generator:
+                if submission.author.name in self.args.ignore_user:
+                    logger.debug(
+                        f'Submission {submission.id} in {submission.subreddit.display_name} skipped'
+                        f' due to {submission.author.name} being an ignored user')
+                    continue
                 logger.debug(f'Attempting to archive submission {submission.id}')
                 self.write_entry(submission)
 
diff --git a/bdfr/configuration.py b/bdfr/configuration.py
@@ -18,6 +18,7 @@ def __init__(self):
         self.exclude_id_file = []
         self.file_scheme: str = '{REDDITOR}_{TITLE}_{POSTID}'
         self.folder_scheme: str = '{SUBREDDIT}'
+        self.ignore_user = []
         self.include_id_file = []
         self.limit: Optional[int] = None
         self.link: list[str] = []
diff --git a/bdfr/downloader.py b/bdfr/downloader.py
@@ -51,6 +51,11 @@ def _download_submission(self, submission: praw.models.Submission):
         elif submission.subreddit.display_name.lower() in self.args.skip_subreddit:
             logger.debug(f'Submission {submission.id} in {submission.subreddit.display_name} in skip list')
             return
+        elif submission.author.name in self.args.ignore_user:
+            logger.debug(
+                f'Submission {submission.id} in {submission.subreddit.display_name} skipped'
+                f' due to {submission.author.name} being an ignored user')
+            return
         elif not isinstance(submission, praw.models.Submission):
             logger.warning(f'{submission.id} is not a submission')
             return
diff --git a/bdfr/file_name_formatter.py b/bdfr/file_name_formatter.py
@@ -110,31 +110,39 @@ def format_path(
         index = f'_{str(index)}' if index else ''
         if not resource.extension:
             raise BulkDownloaderException(f'Resource from {resource.url} has no extension')
-        ending = index + resource.extension
         file_name = str(self._format_name(resource.source_submission, self.file_format_string))
+        if not re.match(r'.*\.$', file_name) and not re.match(r'^\..*', resource.extension):
+            ending = index + '.' + resource.extension
+        else:
+            ending = index + resource.extension
 
         try:
-            file_path = self._limit_file_name_length(file_name, ending, subfolder)
+            file_path = self.limit_file_name_length(file_name, ending, subfolder)
         except TypeError:
             raise BulkDownloaderException(f'Could not determine path name: {subfolder}, {index}, {resource.extension}')
         return file_path
 
     @staticmethod
-    def _limit_file_name_length(filename: str, ending: str, root: Path) -> Path:
+    def limit_file_name_length(filename: str, ending: str, root: Path) -> Path:
         root = root.resolve().expanduser()
         possible_id = re.search(r'((?:_\w{6})?$)', filename)
         if possible_id:
             ending = possible_id.group(1) + ending
             filename = filename[:possible_id.start()]
         max_path = FileNameFormatter.find_max_path_length()
-        max_length_chars = 255 - len(ending)
-        max_length_bytes = 255 - len(ending.encode('utf-8'))
+        max_file_part_length_chars = 255 - len(ending)
+        max_file_part_length_bytes = 255 - len(ending.encode('utf-8'))
         max_path_length = max_path - len(ending) - len(str(root)) - 1
-        while len(filename) > max_length_chars or \
-                len(filename.encode('utf-8')) > max_length_bytes or \
-                len(filename) > max_path_length:
+
+        out = Path(root, filename + ending)
+        while any([len(filename) > max_file_part_length_chars,
+                   len(filename.encode('utf-8')) > max_file_part_length_bytes,
+                   len(str(out)) > max_path_length,
+                   ]):
             filename = filename[:-1]
-        return Path(root, filename + ending)
+            out = Path(root, filename + ending)
+
+        return out
 
     @staticmethod
     def find_max_path_length() -> int:
diff --git a/bdfr/site_downloaders/download_factory.py b/bdfr/site_downloaders/download_factory.py
@@ -9,7 +9,7 @@
 from bdfr.site_downloaders.base_downloader import BaseDownloader
 from bdfr.site_downloaders.direct import Direct
 from bdfr.site_downloaders.erome import Erome
-from bdfr.site_downloaders.fallback_downloaders.youtubedl_fallback import YoutubeDlFallback
+from bdfr.site_downloaders.fallback_downloaders.ytdlp_fallback import YtdlpFallback
 from bdfr.site_downloaders.gallery import Gallery
 from bdfr.site_downloaders.gfycat import Gfycat
 from bdfr.site_downloaders.imgur import Imgur
@@ -24,7 +24,7 @@ class DownloadFactory:
     @staticmethod
     def pull_lever(url: str) -> Type[BaseDownloader]:
         sanitised_url = DownloadFactory.sanitise_url(url)
-        if re.match(r'(i\.)?imgur.*\.gifv$', sanitised_url):
+        if re.match(r'(i\.)?imgur.*\.gif.+$', sanitised_url):
             return Imgur
         elif re.match(r'.*/.*\.\w{3,4}(\?[\w;&=]*)?$', sanitised_url) and \
                 not DownloadFactory.is_web_resource(sanitised_url):
@@ -49,8 +49,8 @@ def pull_lever(url: str) -> Type[BaseDownloader]:
             return PornHub
         elif re.match(r'vidble\.com', sanitised_url):
             return Vidble
-        elif YoutubeDlFallback.can_handle_link(sanitised_url):
-            return YoutubeDlFallback
+        elif YtdlpFallback.can_handle_link(sanitised_url):
+            return YtdlpFallback
         else:
             raise NotADownloadableLinkError(f'No downloader module exists for url {url}')
 
diff --git a/bdfr/site_downloaders/fallback_downloaders/ytdlp_fallback.py b/bdfr/site_downloaders/fallback_downloaders/ytdlp_fallback.py
@@ -6,6 +6,7 @@
 
 from praw.models import Submission
 
+from bdfr.exceptions import NotADownloadableLinkError
 from bdfr.resource import Resource
 from bdfr.site_authenticator import SiteAuthenticator
 from bdfr.site_downloaders.fallback_downloaders.fallback_downloader import BaseFallbackDownloader
@@ -14,9 +15,9 @@
 logger = logging.getLogger(__name__)
 
 
-class YoutubeDlFallback(BaseFallbackDownloader, Youtube):
+class YtdlpFallback(BaseFallbackDownloader, Youtube):
     def __init__(self, post: Submission):
-        super(YoutubeDlFallback, self).__init__(post)
+        super(YtdlpFallback, self).__init__(post)
 
     def find_resources(self, authenticator: Optional[SiteAuthenticator] = None) -> list[Resource]:
         out = Resource(
@@ -29,8 +30,9 @@ def find_resources(self, authenticator: Optional[SiteAuthenticator] = None) -> l
 
     @staticmethod
     def can_handle_link(url: str) -> bool:
-        attributes = YoutubeDlFallback.get_video_attributes(url)
+        try:
+            attributes = YtdlpFallback.get_video_attributes(url)
+        except NotADownloadableLinkError:
+            return False
         if attributes:
             return True
-        else:
-            return False
diff --git a/bdfr/site_downloaders/imgur.py b/bdfr/site_downloaders/imgur.py
@@ -42,9 +42,9 @@ def _compute_image_url(self, image: dict) -> Resource:
     @staticmethod
     def _get_data(link: str) -> dict:
         link = link.rstrip('?')
-        if re.match(r'(?i).*\.gifv$', link):
+        if re.match(r'(?i).*\.gif.+$', link):
             link = link.replace('i.imgur', 'imgur')
-            link = re.sub('(?i)\\.gifv$', '', link)
+            link = re.sub('(?i)\\.gif.+$', '', link)
 
         res = Imgur.retrieve_url(link, cookies={'over18': '1', 'postpagebeta': '0'})
 
diff --git a/bdfr/site_downloaders/pornhub.py b/bdfr/site_downloaders/pornhub.py
@@ -6,6 +6,7 @@
 
 from praw.models import Submission
 
+from bdfr.exceptions import SiteDownloaderError
 from bdfr.resource import Resource
 from bdfr.site_authenticator import SiteAuthenticator
 from bdfr.site_downloaders.youtube import Youtube
@@ -22,10 +23,15 @@ def find_resources(self, authenticator: Optional[SiteAuthenticator] = None) -> l
             'format': 'best',
             'nooverwrites': True,
         }
+        if video_attributes := super().get_video_attributes(self.post.url):
+            extension = video_attributes['ext']
+        else:
+            raise SiteDownloaderError()
+
         out = Resource(
             self.post,
             self.post.url,
             super()._download_video(ytdl_options),
-            super().get_video_attributes(self.post.url)['ext'],
+            extension,
         )
         return [out]
diff --git a/bdfr/site_downloaders/youtube.py b/bdfr/site_downloaders/youtube.py
@@ -27,10 +27,7 @@ def find_resources(self, authenticator: Optional[SiteAuthenticator] = None) -> l
             'nooverwrites': True,
         }
         download_function = self._download_video(ytdl_options)
-        try:
-            extension = self.get_video_attributes(self.post.url)['ext']
-        except KeyError:
-            raise NotADownloadableLinkError(f'Youtube-DL cannot download URL {self.post.url}')
+        extension = self.get_video_attributes(self.post.url)['ext']
         res = Resource(self.post, self.post.url, download_function, extension)
         return [res]
 
@@ -67,6 +64,10 @@ def get_video_attributes(url: str) -> dict:
         with yt_dlp.YoutubeDL({'logger': yt_logger, }) as ydl:
             try:
                 result = ydl.extract_info(url, download=False)
-                return result
             except Exception as e:
                 logger.exception(e)
+                raise NotADownloadableLinkError(f'Video info extraction failed for {url}')
+        if 'ext' in result:
+            return result
+        else:
+            raise NotADownloadableLinkError(f'Video info extraction failed for {url}')
diff --git a/setup.cfg b/setup.cfg
@@ -4,7 +4,7 @@ description_file = README.md
 description_content_type = text/markdown
 home_page = https://github.com/aliparlakci/bulk-downloader-for-reddit
 keywords = reddit, download, archive
-version = 2.4.2
+version = 2.5.0
 author = Ali Parlakci
 author_email = parlakciali@gmail.com
 maintainer = Serene Arc
diff --git a/tests/integration_tests/test_archive_integration.py b/tests/integration_tests/test_archive_integration.py
@@ -106,3 +106,18 @@ def test_cli_archive_long(test_args: list[str], tmp_path: Path):
     result = runner.invoke(cli, test_args)
     assert result.exit_code == 0
     assert re.search(r'Writing entry .*? to file in .*? format', result.output)
+
+
+@pytest.mark.online
+@pytest.mark.reddit
+@pytest.mark.skipif(not does_test_config_exist, reason='A test config file is required for integration tests')
+@pytest.mark.parametrize('test_args', (
+    ['--ignore-user', 'ArjanEgges', '-l', 'm3hxzd'],
+))
+def test_cli_archive_ignore_user(test_args: list[str], tmp_path: Path):
+    runner = CliRunner()
+    test_args = create_basic_args_for_archive_runner(test_args, tmp_path)
+    result = runner.invoke(cli, test_args)
+    assert result.exit_code == 0
+    assert 'being an ignored user' in result.output
+    assert 'Attempting to archive submission' not in result.output
diff --git a/tests/integration_tests/test_download_integration.py b/tests/integration_tests/test_download_integration.py
@@ -337,3 +337,18 @@ def test_cli_download_include_id_file(tmp_path: Path):
     result = runner.invoke(cli, test_args)
     assert result.exit_code == 0
     assert 'Downloaded submission' in result.output
+
+
+@pytest.mark.online
+@pytest.mark.reddit
+@pytest.mark.skipif(not does_test_config_exist, reason='A test config file is required for integration tests')
+@pytest.mark.parametrize('test_args', (
+    ['--ignore-user', 'ArjanEgges', '-l', 'm3hxzd'],
+))
+def test_cli_download_ignore_user(test_args: list[str], tmp_path: Path):
+    runner = CliRunner()
+    test_args = create_basic_args_for_download_runner(test_args, tmp_path)
+    result = runner.invoke(cli, test_args)
+    assert result.exit_code == 0
+    assert 'Downloaded submission' not in result.output
+    assert 'being an ignored user' in result.output
diff --git a/tests/site_downloaders/fallback_downloaders/test_ytdlp_fallback.py b/tests/site_downloaders/fallback_downloaders/test_ytdlp_fallback.py
@@ -4,21 +4,32 @@
 
 import pytest
 
+from bdfr.exceptions import NotADownloadableLinkError
 from bdfr.resource import Resource
-from bdfr.site_downloaders.fallback_downloaders.youtubedl_fallback import YoutubeDlFallback
+from bdfr.site_downloaders.fallback_downloaders.ytdlp_fallback import YtdlpFallback
 
 
 @pytest.mark.online
 @pytest.mark.parametrize(('test_url', 'expected'), (
     ('https://www.reddit.com/r/specializedtools/comments/n2nw5m/bamboo_splitter/', True),
     ('https://www.youtube.com/watch?v=P19nvJOmqCc', True),
     ('https://www.example.com/test', False),
+    ('https://milesmatrix.bandcamp.com/album/la-boum/', False),
 ))
 def test_can_handle_link(test_url: str, expected: bool):
-    result = YoutubeDlFallback.can_handle_link(test_url)
+    result = YtdlpFallback.can_handle_link(test_url)
     assert result == expected
 
 
+@pytest.mark.online
+@pytest.mark.parametrize('test_url', (
+    'https://milesmatrix.bandcamp.com/album/la-boum/',
+))
+def test_info_extraction_bad(test_url: str):
+    with pytest.raises(NotADownloadableLinkError):
+        YtdlpFallback.get_video_attributes(test_url)
+
+
 @pytest.mark.online
 @pytest.mark.slow
 @pytest.mark.parametrize(('test_url', 'expected_hash'), (
@@ -30,7 +41,7 @@ def test_can_handle_link(test_url: str, expected: bool):
 def test_find_resources(test_url: str, expected_hash: str):
     test_submission = MagicMock()
     test_submission.url = test_url
-    downloader = YoutubeDlFallback(test_submission)
+    downloader = YtdlpFallback(test_submission)
     resources = downloader.find_resources()
     assert len(resources) == 1
     assert isinstance(resources[0], Resource)
diff --git a/tests/site_downloaders/test_download_factory.py b/tests/site_downloaders/test_download_factory.py
@@ -9,7 +9,7 @@
 from bdfr.site_downloaders.direct import Direct
 from bdfr.site_downloaders.download_factory import DownloadFactory
 from bdfr.site_downloaders.erome import Erome
-from bdfr.site_downloaders.fallback_downloaders.youtubedl_fallback import YoutubeDlFallback
+from bdfr.site_downloaders.fallback_downloaders.ytdlp_fallback import YtdlpFallback
 from bdfr.site_downloaders.gallery import Gallery
 from bdfr.site_downloaders.gfycat import Gfycat
 from bdfr.site_downloaders.imgur import Imgur
@@ -30,6 +30,7 @@
     ('https://imgur.com/BuzvZwb.gifv', Imgur),
     ('https://i.imgur.com/6fNdLst.gif', Direct),
     ('https://imgur.com/a/MkxAzeg', Imgur),
+    ('https://i.imgur.com/OGeVuAe.giff', Imgur),
     ('https://www.reddit.com/gallery/lu93m7', Gallery),
     ('https://gfycat.com/concretecheerfulfinwhale', Gfycat),
     ('https://www.erome.com/a/NWGw0F09', Erome),
@@ -41,10 +42,10 @@
     ('https://i.imgur.com/3SKrQfK.jpg?1', Direct),
     ('https://dynasty-scans.com/system/images_images/000/017/819/original/80215103_p0.png?1612232781', Direct),
     ('https://m.imgur.com/a/py3RW0j', Imgur),
-    ('https://v.redd.it/9z1dnk3xr5k61', YoutubeDlFallback),
-    ('https://streamable.com/dt46y', YoutubeDlFallback),
-    ('https://vimeo.com/channels/31259/53576664', YoutubeDlFallback),
-    ('http://video.pbs.org/viralplayer/2365173446/', YoutubeDlFallback),
+    ('https://v.redd.it/9z1dnk3xr5k61', YtdlpFallback),
+    ('https://streamable.com/dt46y', YtdlpFallback),
+    ('https://vimeo.com/channels/31259/53576664', YtdlpFallback),
+    ('http://video.pbs.org/viralplayer/2365173446/', YtdlpFallback),
     ('https://www.pornhub.com/view_video.php?viewkey=ph5a2ee0461a8d0', PornHub),
 ))
 def test_factory_lever_good(test_submission_url: str, expected_class: BaseDownloader, reddit_instance: praw.Reddit):
diff --git a/tests/site_downloaders/test_imgur.py b/tests/site_downloaders/test_imgur.py
diff --git a/tests/site_downloaders/test_pornhub.py b/tests/site_downloaders/test_pornhub.py
diff --git a/tests/test_file_name_formatter.py b/tests/test_file_name_formatter.py