Skip to content

Commit 8718295

Browse files
authored
Merge pull request #562 from aliparlakci/development
2 parents 8104ce3 + cc80acd commit 8718295

19 files changed

Lines changed: 192 additions & 41 deletions

README.md

Lines changed: 20 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,8 @@ This is a tool to download submissions or submission data from Reddit. It can be
77

88
If you wish to open an issue, please read [the guide on opening issues](docs/CONTRIBUTING.md#opening-an-issue) to ensure that your issue is clear and contains everything it needs to for the developers to investigate.
99

10+
Included in this README are a few example Bash tricks to get certain behaviour. For that, see [Common Command Tricks](#common-command-tricks).
11+
1012
## Installation
1113
*Bulk Downloader for Reddit* needs Python version 3.9 or above. Please update Python before installation to meet the requirement. Then, you can install it as such:
1214
```bash
@@ -76,6 +78,9 @@ The following options are common between both the `archive` and `download` comma
7678
- Can be specified multiple times
7779
- Disables certain modules from being used
7880
- See [Disabling Modules](#disabling-modules) for more information and a list of module names
81+
- `--ignore-user`
82+
- This will add a user to ignore
83+
- Can be specified multiple times
7984
- `--include-id-file`
8085
- This will add any submission with the IDs in the files provided
8186
- Can be specified multiple times
@@ -208,6 +213,16 @@ The following options are for the `archive` command specifically.
208213

209214
The `clone` command can take all the options listed above for both the `archive` and `download` commands since it performs the functions of both.
210215

216+
## Common Command Tricks
217+
218+
A common use case is for subreddits/users to be loaded from a file. The BDFR doesn't support this directly but it is simple enough to do through the command-line. Consider a list of usernames to download; they can be passed through to the BDFR with the following command, assuming that the usernames are in a text file:
219+
220+
```bash
221+
cat users.txt | xargs -L 1 echo --user | xargs -L 50 python3 -m bdfr download <ARGS>
222+
```
223+
224+
The part `-L 50` is to make sure that the character limit for a single line isn't exceeded, but may not be necessary. This can also be used to load subreddits from a file, simply exchange `--user` with `--subreddit` and so on.
225+
211226
## Authentication and Security
212227

213228
The BDFR uses OAuth2 authentication to connect to Reddit if authentication is required. This means that it is a secure, token-based system for making requests. This also means that the BDFR only has access to specific parts of the account authenticated, by default only saved posts, upvoted posts, and the identity of the authenticated account. Note that authentication is not required unless accessing private things like upvoted posts, saved posts, and private multireddits.
@@ -320,10 +335,14 @@ The BDFR can be run in multiple instances with multiple configurations, either c
320335

321336
Running these scenarios consecutively is done easily, like any single run. Configuration files that differ may be specified with the `--config` option to switch between tokens, for example. Otherwise, almost all configuration for data sources can be specified per-run through the command line.
322337

323-
Running scenarious concurrently (at the same time) however, is more complicated. The BDFR will look to a single, static place to put the detailed log files, in a directory with the configuration file specified above. If there are multiple instances, or processes, of the BDFR running at the same time, they will all be trying to write to a single file. On Linux and other UNIX based operating systems, this will succeed, though there is a substantial risk that the logfile will be useless due to garbled and jumbled data. On Windows however, attempting this will raise an error that crashes the program as Windows forbids multiple processes from accessing the same file.
338+
Running scenarios concurrently (at the same time) however, is more complicated. The BDFR will look to a single, static place to put the detailed log files, in a directory with the configuration file specified above. If there are multiple instances, or processes, of the BDFR running at the same time, they will all be trying to write to a single file. On Linux and other UNIX based operating systems, this will succeed, though there is a substantial risk that the logfile will be useless due to garbled and jumbled data. On Windows however, attempting this will raise an error that crashes the program as Windows forbids multiple processes from accessing the same file.
324339

325340
The way to fix this is to use the `--log` option to manually specify where the logfile is to be stored. If the given location is unique to each instance of the BDFR, then it will run fine.
326341

342+
## Manipulating Logfiles
343+
344+
The logfiles that the BDFR outputs are consistent and quite detailed and in a format that is amenable to regex. To this end, a number of bash scripts have been [included here](./scripts). They show examples for how to extract successfully downloaded IDs, failed IDs, and more besides.
345+
327346
## List of currently supported sources
328347

329348
- Direct links (links leading to a file)

bdfr/__main__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@
1717
click.option('--authenticate', is_flag=True, default=None),
1818
click.option('--config', type=str, default=None),
1919
click.option('--disable-module', multiple=True, default=None, type=str),
20+
click.option('--ignore-user', type=str, multiple=True, default=None),
2021
click.option('--include-id-file', multiple=True, default=None),
2122
click.option('--log', type=str, default=None),
2223
click.option('--saved', is_flag=True, default=None),

bdfr/archiver.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,11 @@ def __init__(self, args: Configuration):
2828
def download(self):
2929
for generator in self.reddit_lists:
3030
for submission in generator:
31+
if submission.author.name in self.args.ignore_user:
32+
logger.debug(
33+
f'Submission {submission.id} in {submission.subreddit.display_name} skipped'
34+
f' due to {submission.author.name} being an ignored user')
35+
continue
3136
logger.debug(f'Attempting to archive submission {submission.id}')
3237
self.write_entry(submission)
3338

bdfr/configuration.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@ def __init__(self):
1818
self.exclude_id_file = []
1919
self.file_scheme: str = '{REDDITOR}_{TITLE}_{POSTID}'
2020
self.folder_scheme: str = '{SUBREDDIT}'
21+
self.ignore_user = []
2122
self.include_id_file = []
2223
self.limit: Optional[int] = None
2324
self.link: list[str] = []

bdfr/downloader.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -51,6 +51,11 @@ def _download_submission(self, submission: praw.models.Submission):
5151
elif submission.subreddit.display_name.lower() in self.args.skip_subreddit:
5252
logger.debug(f'Submission {submission.id} in {submission.subreddit.display_name} in skip list')
5353
return
54+
elif submission.author.name in self.args.ignore_user:
55+
logger.debug(
56+
f'Submission {submission.id} in {submission.subreddit.display_name} skipped'
57+
f' due to {submission.author.name} being an ignored user')
58+
return
5459
elif not isinstance(submission, praw.models.Submission):
5560
logger.warning(f'{submission.id} is not a submission')
5661
return

bdfr/file_name_formatter.py

Lines changed: 17 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -110,31 +110,39 @@ def format_path(
110110
index = f'_{str(index)}' if index else ''
111111
if not resource.extension:
112112
raise BulkDownloaderException(f'Resource from {resource.url} has no extension')
113-
ending = index + resource.extension
114113
file_name = str(self._format_name(resource.source_submission, self.file_format_string))
114+
if not re.match(r'.*\.$', file_name) and not re.match(r'^\..*', resource.extension):
115+
ending = index + '.' + resource.extension
116+
else:
117+
ending = index + resource.extension
115118

116119
try:
117-
file_path = self._limit_file_name_length(file_name, ending, subfolder)
120+
file_path = self.limit_file_name_length(file_name, ending, subfolder)
118121
except TypeError:
119122
raise BulkDownloaderException(f'Could not determine path name: {subfolder}, {index}, {resource.extension}')
120123
return file_path
121124

122125
@staticmethod
123-
def _limit_file_name_length(filename: str, ending: str, root: Path) -> Path:
126+
def limit_file_name_length(filename: str, ending: str, root: Path) -> Path:
124127
root = root.resolve().expanduser()
125128
possible_id = re.search(r'((?:_\w{6})?$)', filename)
126129
if possible_id:
127130
ending = possible_id.group(1) + ending
128131
filename = filename[:possible_id.start()]
129132
max_path = FileNameFormatter.find_max_path_length()
130-
max_length_chars = 255 - len(ending)
131-
max_length_bytes = 255 - len(ending.encode('utf-8'))
133+
max_file_part_length_chars = 255 - len(ending)
134+
max_file_part_length_bytes = 255 - len(ending.encode('utf-8'))
132135
max_path_length = max_path - len(ending) - len(str(root)) - 1
133-
while len(filename) > max_length_chars or \
134-
len(filename.encode('utf-8')) > max_length_bytes or \
135-
len(filename) > max_path_length:
136+
137+
out = Path(root, filename + ending)
138+
while any([len(filename) > max_file_part_length_chars,
139+
len(filename.encode('utf-8')) > max_file_part_length_bytes,
140+
len(str(out)) > max_path_length,
141+
]):
136142
filename = filename[:-1]
137-
return Path(root, filename + ending)
143+
out = Path(root, filename + ending)
144+
145+
return out
138146

139147
@staticmethod
140148
def find_max_path_length() -> int:

bdfr/site_downloaders/download_factory.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@
99
from bdfr.site_downloaders.base_downloader import BaseDownloader
1010
from bdfr.site_downloaders.direct import Direct
1111
from bdfr.site_downloaders.erome import Erome
12-
from bdfr.site_downloaders.fallback_downloaders.youtubedl_fallback import YoutubeDlFallback
12+
from bdfr.site_downloaders.fallback_downloaders.ytdlp_fallback import YtdlpFallback
1313
from bdfr.site_downloaders.gallery import Gallery
1414
from bdfr.site_downloaders.gfycat import Gfycat
1515
from bdfr.site_downloaders.imgur import Imgur
@@ -24,7 +24,7 @@ class DownloadFactory:
2424
@staticmethod
2525
def pull_lever(url: str) -> Type[BaseDownloader]:
2626
sanitised_url = DownloadFactory.sanitise_url(url)
27-
if re.match(r'(i\.)?imgur.*\.gifv$', sanitised_url):
27+
if re.match(r'(i\.)?imgur.*\.gif.+$', sanitised_url):
2828
return Imgur
2929
elif re.match(r'.*/.*\.\w{3,4}(\?[\w;&=]*)?$', sanitised_url) and \
3030
not DownloadFactory.is_web_resource(sanitised_url):
@@ -49,8 +49,8 @@ def pull_lever(url: str) -> Type[BaseDownloader]:
4949
return PornHub
5050
elif re.match(r'vidble\.com', sanitised_url):
5151
return Vidble
52-
elif YoutubeDlFallback.can_handle_link(sanitised_url):
53-
return YoutubeDlFallback
52+
elif YtdlpFallback.can_handle_link(sanitised_url):
53+
return YtdlpFallback
5454
else:
5555
raise NotADownloadableLinkError(f'No downloader module exists for url {url}')
5656

bdfr/site_downloaders/fallback_downloaders/youtubedl_fallback.py renamed to bdfr/site_downloaders/fallback_downloaders/ytdlp_fallback.py

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@
66

77
from praw.models import Submission
88

9+
from bdfr.exceptions import NotADownloadableLinkError
910
from bdfr.resource import Resource
1011
from bdfr.site_authenticator import SiteAuthenticator
1112
from bdfr.site_downloaders.fallback_downloaders.fallback_downloader import BaseFallbackDownloader
@@ -14,9 +15,9 @@
1415
logger = logging.getLogger(__name__)
1516

1617

17-
class YoutubeDlFallback(BaseFallbackDownloader, Youtube):
18+
class YtdlpFallback(BaseFallbackDownloader, Youtube):
1819
def __init__(self, post: Submission):
19-
super(YoutubeDlFallback, self).__init__(post)
20+
super(YtdlpFallback, self).__init__(post)
2021

2122
def find_resources(self, authenticator: Optional[SiteAuthenticator] = None) -> list[Resource]:
2223
out = Resource(
@@ -29,8 +30,9 @@ def find_resources(self, authenticator: Optional[SiteAuthenticator] = None) -> l
2930

3031
@staticmethod
3132
def can_handle_link(url: str) -> bool:
32-
attributes = YoutubeDlFallback.get_video_attributes(url)
33+
try:
34+
attributes = YtdlpFallback.get_video_attributes(url)
35+
except NotADownloadableLinkError:
36+
return False
3337
if attributes:
3438
return True
35-
else:
36-
return False

bdfr/site_downloaders/imgur.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -42,9 +42,9 @@ def _compute_image_url(self, image: dict) -> Resource:
4242
@staticmethod
4343
def _get_data(link: str) -> dict:
4444
link = link.rstrip('?')
45-
if re.match(r'(?i).*\.gifv$', link):
45+
if re.match(r'(?i).*\.gif.+$', link):
4646
link = link.replace('i.imgur', 'imgur')
47-
link = re.sub('(?i)\\.gifv$', '', link)
47+
link = re.sub('(?i)\\.gif.+$', '', link)
4848

4949
res = Imgur.retrieve_url(link, cookies={'over18': '1', 'postpagebeta': '0'})
5050

bdfr/site_downloaders/pornhub.py

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@
66

77
from praw.models import Submission
88

9+
from bdfr.exceptions import SiteDownloaderError
910
from bdfr.resource import Resource
1011
from bdfr.site_authenticator import SiteAuthenticator
1112
from bdfr.site_downloaders.youtube import Youtube
@@ -22,10 +23,15 @@ def find_resources(self, authenticator: Optional[SiteAuthenticator] = None) -> l
2223
'format': 'best',
2324
'nooverwrites': True,
2425
}
26+
if video_attributes := super().get_video_attributes(self.post.url):
27+
extension = video_attributes['ext']
28+
else:
29+
raise SiteDownloaderError()
30+
2531
out = Resource(
2632
self.post,
2733
self.post.url,
2834
super()._download_video(ytdl_options),
29-
super().get_video_attributes(self.post.url)['ext'],
35+
extension,
3036
)
3137
return [out]

0 commit comments

Comments
 (0)