Skip to content

Commit e7629d7

Browse files
authored
Merge pull request #640 from aliparlakci/development
2 parents e4fcacf + 0ce2585 commit e7629d7

46 files changed

Lines changed: 672 additions & 141 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.gitattributes

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
# Declare files that will always have CRLF line endings on checkout.
2+
*.ps1 text eol=crlf
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
name: Protect master branch
2+
3+
on:
4+
pull_request:
5+
branches:
6+
- master
7+
jobs:
8+
merge_check:
9+
runs-on: ubuntu-latest
10+
steps:
11+
- name: Check if the pull request is mergeable to master
12+
run: |
13+
if [[ "$GITHUB_HEAD_REF" == 'development' && "$GITHUB_REPOSITORY" == 'aliparlakci/bulk-downloader-for-reddit' ]]; then exit 0; else exit 1; fi;

README.md

Lines changed: 50 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,12 @@ However, these commands are not enough. You should chain parameters in [Options]
5353
python3 -m bdfr download ./path/to/output --subreddit Python -L 10
5454
```
5555
```bash
56+
python3 -m bdfr download ./path/to/output --user reddituser --submitted -L 100
57+
```
58+
```bash
59+
python3 -m bdfr download ./path/to/output --user reddituser --submitted --all-comments --comment-context
60+
```
61+
```bash
5662
python3 -m bdfr download ./path/to/output --user me --saved --authenticate -L 25 --file-scheme '{POSTID}'
5763
```
5864
```bash
@@ -62,6 +68,31 @@ python3 -m bdfr download ./path/to/output --subreddit 'Python, all, mindustry' -
6268
python3 -m bdfr archive ./path/to/output --subreddit all --format yaml -L 500 --folder-scheme ''
6369
```
6470

71+
Alternatively, you can pass options through a YAML file.
72+
73+
```bash
74+
python3 -m bdfr download ./path/to/output --opts my_opts.yaml
75+
```
76+
77+
For example, running it with the following file
78+
79+
```yaml
80+
skip: [mp4, avi]
81+
file_scheme: "{UPVOTES}_{REDDITOR}_{POSTID}_{DATE}"
82+
limit: 10
83+
sort: top
84+
subreddit:
85+
- EarthPorn
86+
- CityPorn
87+
```
88+
89+
would be equilavent to (take note that in YAML there is `file_scheme` instead of `file-scheme`):
90+
```bash
91+
python3 -m bdfr download ./path/to/output --skip mp4 --skip avi --file-scheme "{UPVOTES}_{REDDITOR}_{POSTID}_{DATE}" -L 10 -S top --subreddit EarthPorn --subreddit CityPorn
92+
```
93+
94+
In case when the same option is specified both in the YAML file and in as a command line argument, the command line argument takes prs
95+
6596
## Options
6697

6798
The following options are common between both the `archive` and `download` commands of the BDFR.
@@ -74,6 +105,10 @@ The following options are common between both the `archive` and `download` comma
74105
- `--config`
75106
- If the path to a configuration file is supplied with this option, the BDFR will use the specified config
76107
- See [Configuration Files](#configuration) for more details
108+
- `--opts`
109+
- Load options from a YAML file.
110+
- Has higher prority than the global config file but lower than command-line arguments.
111+
- See [opts_example.yaml](./opts_example.yaml) for an example file.
77112
- `--disable-module`
78113
- Can be specified multiple times
79114
- Disables certain modules from being used
@@ -92,8 +127,8 @@ The following options are common between both the `archive` and `download` comma
92127
- This option will make the BDFR use the supplied user's saved posts list as a download source
93128
- This requires an authenticated Reddit instance, using the `--authenticate` flag, as well as `--user` set to `me`
94129
- `--search`
95-
- This will apply the specified search term to specific lists when scraping submissions
96-
- A search term can only be applied to subreddits and multireddits, supplied with the `- s` and `-m` flags respectively
130+
- This will apply the input search term to specific lists when scraping submissions
131+
- A search term can only be applied when using the `--subreddit` and `--multireddit` flags
97132
- `--submitted`
98133
- This will use a user's submissions as a source
99134
- A user must be specified with `--user`
@@ -192,6 +227,15 @@ The following options apply only to the `download` command. This command downloa
192227
- This skips all submissions from the specified subreddit
193228
- Can be specified multiple times
194229
- Also accepts CSV subreddit names
230+
- `--min-score`
231+
- This skips all submissions which have fewer than specified upvotes
232+
- `--max-score`
233+
- This skips all submissions which have more than specified upvotes
234+
- `--min-score-ratio`
235+
- This skips all submissions which have lower than specified upvote ratio
236+
- `--max-score-ratio`
237+
- This skips all submissions which have higher than specified upvote ratio
238+
195239

196240
### Archiver Options
197241

@@ -215,7 +259,10 @@ The `clone` command can take all the options listed above for both the `archive`
215259

216260
## Common Command Tricks
217261

218-
A common use case is for subreddits/users to be loaded from a file. The BDFR doesn't support this directly but it is simple enough to do through the command-line. Consider a list of usernames to download; they can be passed through to the BDFR with the following command, assuming that the usernames are in a text file:
262+
A common use case is for subreddits/users to be loaded from a file. The BDFR supports this via YAML file options (`--opts my_opts.yaml`).
263+
264+
Alternatively, you can use the command-line [xargs](https://en.wikipedia.org/wiki/Xargs) function.
265+
For a list of users `users.txt` (one user per line), type:
219266

220267
```bash
221268
cat users.txt | xargs -L 1 echo --user | xargs -L 50 python3 -m bdfr download <ARGS>

bdfr/__main__.py

Lines changed: 10 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -16,13 +16,19 @@
1616
click.argument('directory', type=str),
1717
click.option('--authenticate', is_flag=True, default=None),
1818
click.option('--config', type=str, default=None),
19+
click.option('--opts', type=str, default=None),
1920
click.option('--disable-module', multiple=True, default=None, type=str),
21+
click.option('--exclude-id', default=None, multiple=True),
22+
click.option('--exclude-id-file', default=None, multiple=True),
23+
click.option('--file-scheme', default=None, type=str),
24+
click.option('--folder-scheme', default=None, type=str),
2025
click.option('--ignore-user', type=str, multiple=True, default=None),
2126
click.option('--include-id-file', multiple=True, default=None),
2227
click.option('--log', type=str, default=None),
2328
click.option('--saved', is_flag=True, default=None),
2429
click.option('--search', default=None, type=str),
2530
click.option('--submitted', is_flag=True, default=None),
31+
click.option('--subscribed', is_flag=True, default=None),
2632
click.option('--time-format', type=str, default=None),
2733
click.option('--upvoted', is_flag=True, default=None),
2834
click.option('-L', '--limit', default=None, type=int),
@@ -37,17 +43,17 @@
3743
]
3844

3945
_downloader_options = [
40-
click.option('--file-scheme', default=None, type=str),
41-
click.option('--folder-scheme', default=None, type=str),
4246
click.option('--make-hard-links', is_flag=True, default=None),
4347
click.option('--max-wait-time', type=int, default=None),
4448
click.option('--no-dupes', is_flag=True, default=None),
4549
click.option('--search-existing', is_flag=True, default=None),
46-
click.option('--exclude-id', default=None, multiple=True),
47-
click.option('--exclude-id-file', default=None, multiple=True),
4850
click.option('--skip', default=None, multiple=True),
4951
click.option('--skip-domain', default=None, multiple=True),
5052
click.option('--skip-subreddit', default=None, multiple=True),
53+
click.option('--min-score', type=int, default=None),
54+
click.option('--max-score', type=int, default=None),
55+
click.option('--min-score-ratio', type=float, default=None),
56+
click.option('--max-score-ratio', type=float, default=None),
5157
]
5258

5359
_archiver_options = [

bdfr/archiver.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,9 @@ def download(self):
3434
f'Submission {submission.id} in {submission.subreddit.display_name} skipped'
3535
f' due to {submission.author.name if submission.author else "DELETED"} being an ignored user')
3636
continue
37+
if submission.id in self.excluded_submission_ids:
38+
logger.debug(f'Object {submission.id} in exclusion list, skipping')
39+
continue
3740
logger.debug(f'Attempting to archive submission {submission.id}')
3841
self.write_entry(submission)
3942

bdfr/configuration.py

Lines changed: 37 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,16 +2,21 @@
22
# coding=utf-8
33

44
from argparse import Namespace
5+
from pathlib import Path
56
from typing import Optional
7+
import logging
68

79
import click
10+
import yaml
811

12+
logger = logging.getLogger(__name__)
913

1014
class Configuration(Namespace):
1115
def __init__(self):
1216
super(Configuration, self).__init__()
1317
self.authenticate = False
1418
self.config = None
19+
self.opts: Optional[str] = None
1520
self.directory: str = '.'
1621
self.disable_module: list[str] = []
1722
self.exclude_id = []
@@ -33,8 +38,13 @@ def __init__(self):
3338
self.skip: list[str] = []
3439
self.skip_domain: list[str] = []
3540
self.skip_subreddit: list[str] = []
41+
self.min_score = None
42+
self.max_score = None
43+
self.min_score_ratio = None
44+
self.max_score_ratio = None
3645
self.sort: str = 'hot'
3746
self.submitted: bool = False
47+
self.subscribed: bool = False
3848
self.subreddit: list[str] = []
3949
self.time: str = 'all'
4050
self.time_format = None
@@ -48,6 +58,31 @@ def __init__(self):
4858
self.comment_context: bool = False
4959

5060
def process_click_arguments(self, context: click.Context):
61+
if context.params.get('opts') is not None:
62+
self.parse_yaml_options(context.params['opts'])
5163
for arg_key in context.params.keys():
52-
if arg_key in vars(self) and context.params[arg_key] is not None:
53-
vars(self)[arg_key] = context.params[arg_key]
64+
if not hasattr(self, arg_key):
65+
logger.warning(f'Ignoring an unknown CLI argument: {arg_key}')
66+
continue
67+
val = context.params[arg_key]
68+
if val is None or val == ():
69+
# don't overwrite with an empty value
70+
continue
71+
setattr(self, arg_key, val)
72+
73+
def parse_yaml_options(self, file_path: str):
74+
yaml_file_loc = Path(file_path)
75+
if not yaml_file_loc.exists():
76+
logger.error(f'No YAML file found at {yaml_file_loc}')
77+
return
78+
with open(yaml_file_loc) as file:
79+
try:
80+
opts = yaml.load(file, Loader=yaml.FullLoader)
81+
except yaml.YAMLError as e:
82+
logger.error(f'Could not parse YAML options file: {e}')
83+
return
84+
for arg_key, val in opts.items():
85+
if not hasattr(self, arg_key):
86+
logger.warning(f'Ignoring an unknown YAML argument: {arg_key}')
87+
continue
88+
setattr(self, arg_key, val)

bdfr/connector.py

Lines changed: 17 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -243,9 +243,19 @@ def split_args_input(entries: list[str]) -> set[str]:
243243
return set(all_entries)
244244

245245
def get_subreddits(self) -> list[praw.models.ListingGenerator]:
246-
if self.args.subreddit:
247-
out = []
248-
for reddit in self.split_args_input(self.args.subreddit):
246+
out = []
247+
subscribed_subreddits = set()
248+
if self.args.subscribed:
249+
if self.args.authenticate:
250+
try:
251+
subscribed_subreddits = list(self.reddit_instance.user.subreddits(limit=None))
252+
subscribed_subreddits = set([s.display_name for s in subscribed_subreddits])
253+
except prawcore.InsufficientScope:
254+
logger.error('BDFR has insufficient scope to access subreddit lists')
255+
else:
256+
logger.error('Cannot find subscribed subreddits without an authenticated instance')
257+
if self.args.subreddit or subscribed_subreddits:
258+
for reddit in self.split_args_input(self.args.subreddit) | subscribed_subreddits:
249259
if reddit == 'friends' and self.authenticated is False:
250260
logger.error('Cannot read friends subreddit without an authenticated instance')
251261
continue
@@ -270,9 +280,7 @@ def get_subreddits(self) -> list[praw.models.ListingGenerator]:
270280
logger.debug(f'Added submissions from subreddit {reddit}')
271281
except (errors.BulkDownloaderException, praw.exceptions.PRAWException) as e:
272282
logger.error(f'Failed to get submissions for subreddit {reddit}: {e}')
273-
return out
274-
else:
275-
return []
283+
return out
276284

277285
def resolve_user_name(self, in_name: str) -> str:
278286
if in_name == 'me':
@@ -406,7 +414,9 @@ def check_subreddit_status(subreddit: praw.models.Subreddit):
406414
try:
407415
assert subreddit.id
408416
except prawcore.NotFound:
409-
raise errors.BulkDownloaderException(f'Source {subreddit.display_name} does not exist or cannot be found')
417+
raise errors.BulkDownloaderException(f"Source {subreddit.display_name} cannot be found")
418+
except prawcore.Redirect:
419+
raise errors.BulkDownloaderException(f"Source {subreddit.display_name} does not exist")
410420
except prawcore.Forbidden:
411421
raise errors.BulkDownloaderException(f'Source {subreddit.display_name} is private and cannot be scraped')
412422

bdfr/default_config.cfg

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
[DEFAULT]
22
client_id = U-6gk4ZCh3IeNQ
33
client_secret = 7CZHY6AmKweZME5s50SfDGylaPg
4-
scopes = identity, history, read, save
4+
scopes = identity, history, read, save, mysubreddits
55
backup_log_count = 3
66
max_wait_time = 120
77
time_format = ISO

bdfr/downloader.py

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,19 @@ def _download_submission(self, submission: praw.models.Submission):
5757
f'Submission {submission.id} in {submission.subreddit.display_name} skipped'
5858
f' due to {submission.author.name if submission.author else "DELETED"} being an ignored user')
5959
return
60+
elif self.args.min_score and submission.score < self.args.min_score:
61+
logger.debug(
62+
f"Submission {submission.id} filtered due to score {submission.score} < [{self.args.min_score}]")
63+
return
64+
elif self.args.max_score and self.args.max_score < submission.score:
65+
logger.debug(
66+
f"Submission {submission.id} filtered due to score {submission.score} > [{self.args.max_score}]")
67+
return
68+
elif (self.args.min_score_ratio and submission.upvote_ratio < self.args.min_score_ratio) or (
69+
self.args.max_score_ratio and self.args.max_score_ratio < submission.upvote_ratio
70+
):
71+
logger.debug(f"Submission {submission.id} filtered due to score ratio ({submission.upvote_ratio})")
72+
return
6073
elif not isinstance(submission, praw.models.Submission):
6174
logger.warning(f'{submission.id} is not a submission')
6275
return

bdfr/file_name_formatter.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -111,6 +111,9 @@ def format_path(
111111
if not resource.extension:
112112
raise BulkDownloaderException(f'Resource from {resource.url} has no extension')
113113
file_name = str(self._format_name(resource.source_submission, self.file_format_string))
114+
115+
file_name = re.sub(r'\n', ' ', file_name)
116+
114117
if not re.match(r'.*\.$', file_name) and not re.match(r'^\..*', resource.extension):
115118
ending = index + '.' + resource.extension
116119
else:

0 commit comments

Comments
 (0)