Commit graph

149 commits

Author SHA1 Message Date
Mark Felder
c17a78c55a Rich Media: add stream byte counting as an extra protection against malicious URLs 2024-08-29 09:37:11 -04:00
Mark Felder
0804b73c0a This error is not returned by Tesla
Upstream has a bug filed for this as they aren't handling this error internally, so it was raising
2024-08-27 22:08:31 -04:00
Mark Felder
44901502ff Fix incorrect identifier for the with statement 2024-08-27 21:56:02 -04:00
Mark Felder
116fe77b77 Tesla.Middleware.Timeout breaks streaming bodies
These are executed by Oban now and Oban can enforce the timeout if the regular HTTP timeout is not sufficient.
2024-08-27 21:55:07 -04:00
Mark Felder
0a86d2b3ac Handle streaming response errors 2024-08-27 21:22:59 -04:00
Mark Felder
ec8db9d4ee RichMedia: skip the HTTP HEAD request for adapters that support streaming the response body 2024-08-27 21:09:15 -04:00
Mark Felder
97d488aea3 Fix RichMedia negative cache entries
The negative cache entry was a nil value, but that is an expected response when the cache is missing an entry so it didn't work as intended.
2024-07-24 15:45:39 -04:00
Mark Felder
5a62868106 Consider errors during HTTP GET and HEAD to be unrecoverable and insert a negative cache entry
This is for a normal HTTP error response or timeout while receiving the data. A hard error from a process crash, DNS lookup failure, etc should produce a different response than {:ok, %Tesla.Env{}} and the request/job will be retryable.
2024-07-24 15:43:49 -04:00
Mark Felder
731f7b87d2 Pad RichMediaWorker timeout to be 2s longer than the Rich Media HTTP timeout 2024-07-24 15:42:50 -04:00
Mark Felder
d124d8645e Rework some Rich Media functionality for better error handling
Oban should not retry jobs that are likely to fail again
2024-07-17 12:40:07 -04:00
Mark Felder
b135fa35a1 RichMedia: test that activity is streamed out 2024-06-24 09:47:16 -04:00
Mark Felder
634e3d4155 Add test validating the activity_id is correctly present in the Oban job
This was preventing the activity from being streamed over websockets.
2024-06-23 21:45:56 -04:00
Mark Felder
17d04ccc8b RichMedia backfill processing through Oban 2024-06-19 23:20:22 -04:00
Mark Felder
4dfa50f256 Rename RichMediaExpirationWorker to RichMediaWorker 2024-06-19 22:24:48 -04:00
feld
f44987bd0f Merge branch 'bugfix/rich_media_config' into 'develop'
RichMedia: Respect configuration on status previews

See merge request pleroma/pleroma!4130
2024-06-07 20:37:19 +00:00
Haelwenn (lanodan) Monnier
65c8763907
RichMedia: Add extra checks on configuration 2024-05-29 08:02:06 +02:00
Haelwenn (lanodan) Monnier
c16ef40f13
RichMedia: Respect configuration on status previews 2024-05-29 08:02:04 +02:00
Mark Felder
a50c657427 Add a dedicated connection pool for Rich Media
Sharing this pool with regular Media is problematic as Rich Media will connect to many different
domains and thrash the pool, but regular Media will have predictable connections to the webservers
hosting media for the fediverse servers you peer with.
2024-05-27 11:17:02 -04:00
Mark Felder
807782b7f9 Fix rich media parsing some Amazon URLs 2024-05-26 14:02:20 -04:00
Mark Felder
54c2bab25f Fix module struct matching 2024-05-07 22:27:18 -04:00
Mark Felder
9a83301ff8 Credo 2024-05-07 22:11:19 -04:00
Mark Felder
37c35daba6 Credo 2024-05-07 22:10:49 -04:00
Mark Felder
9b9a32bf74 Fix compile warning
warning: "else" clauses will never match because all patterns in "with" will always match
  lib/pleroma/web/rich_media/parser/ttl/opengraph.ex:10
2024-05-07 21:56:27 -04:00
Mark Felder
5a5a193877 Fix broken Rich Media parsing when the image URL is a relative path 2024-05-07 19:54:56 -04:00
Mark Felder
d21aa1a77c Respect the TTL returned in OpenGraph tags 2024-05-07 19:54:56 -04:00
Mark Felder
df0734fcbf Increase the :max_body for Rich Media to 5MB
Websites are increasingly getting more bloated with tricks like inlining content (e.g., CNN.com) which puts pages at or above 5MB. This value may still be too low.
2024-05-07 19:54:56 -04:00
Mark Felder
ede414094f RichMedia refactor
Rich Media parsing was previously handled on-demand with a 2 second HTTP request timeout and retained only in Cachex. Every time a Pleroma instance is restarted it will have to request and parse the data for each status with a URL detected. When fetching a batch of statuses they were processed in parallel to attempt to keep the maximum latency at 2 seconds, but often resulted in a timeline appearing to hang during loading due to a URL that could not be successfully reached. URLs which had images links that expire (Amazon AWS) were parsed and inserted with a TTL to ensure the image link would not break.

Rich Media data is now cached in the database and fetched asynchronously. Cachex is used as a read-through cache. When the data becomes available we stream an update to the clients. If the result is returned quickly the experience is almost seamless. Activities were already processed for their Rich Media data during ingestion to warm the cache, so users should not normally encounter the asynchronous loading of the Rich Media data.

Implementation notes:

- The async worker is a Task with a globally unique process name to prevent duplicate processing of the same URL
- The Task will attempt to fetch the data 3 times with increasing sleep time between attempts
- The HTTP request obeys the default HTTP request timeout value instead of 2 seconds
- URLs that cannot be successfully parsed due to an unexpected error receives a negative cache entry for 15 minutes
- URLs that fail with an expected error will receive a negative cache with no TTL
- Activities that have no detected URLs insert a nil value in the Cachex :scrubber_cache so we do not repeat parsing the object content with Floki every time the activity is rendered
- Expiring image URLs are handled with an Oban job
- There is no automatic cleanup of the Rich Media data in the database, but it is safe to delete at any time
- The post draft/preview feature makes the URL processing synchronous so the rendered post preview will have an accurate rendering

Overall performance of timelines and creating new posts which contain URLs is greatly improved.
2024-05-07 19:54:56 -04:00
Mark Felder
9f2319e50d RichMedia.Helpers: move the validate_page_url/1 function to the Parser module
This will ensure that the page validation happens in Parser.parse/1 so it can be called from anywhere and still filter invalid URLs.
2024-02-06 18:34:02 -05:00
Mark Felder
6b7b443ff9 Pleroma.Web.RichMedia.Parser: Remove test-specific codepaths
Also consolidate Tesla mocks into the HttpRequestMock module.

Tests were not exercising the real codepaths. The Rich Media Preview only works with https, but most of these tests were only mocking http.
2024-02-06 18:33:54 -05:00
Mark Felder
0cc038b67c Ensure URLs with IP addresses for the host do not generate previews 2024-02-05 00:09:37 -05:00
Mark Felder
579561e97b URI.authority is deprecated 2024-02-04 23:49:07 -05:00
Mark Felder
04fc4eddaa Fix Rich Media Previews for updated activities
The Rich Media Previews were not regenerated when a post was updated due to a cache invalidation issue. They are now cached by the activity id so they can be evicted with the other activity cache objects in the :scrubber_cache.
2024-02-04 23:47:04 -05:00
Haelwenn
251c455b91 Merge branch 'deps-bump' into 'develop'
Bump dependencies

See merge request pleroma/pleroma!4044
2024-01-29 17:43:00 +00:00
Mark Felder
06b8923d42 RichMedia.Parser.TTL.AwsSignedUrl: dialyzer fix
lib/pleroma/web/rich_media/parser/ttl/aws_signed_url.ex:9:callback_type_mismatch
Type mismatch for @callback ttl/2 in Pleroma.Web.RichMedia.Parser.TTL behaviour.

Expected type:
nil | integer()

Actual type:
{:error, <<_::64, _::size(8)>>} | {:ok, integer()}
2024-01-26 17:37:32 -05:00
Mark Felder
5b95abaeea Credo.Check.Readability.PredicateFunctionNames
This check was recently improved in Credo and it does make sense for readability.

The offending functions in Pleroma have been renamed and a couple missing the ? suffix have been fixed as well.
2024-01-26 16:59:58 -05:00
Mark Felder
09ae0ab24a Fix invalid type
lib/pleroma/web/rich_media/parser.ex:105:unknown_type
Unknown type: Integer.t/0.
2024-01-20 17:16:10 -05:00
Mark Felder
467a65af90 Fix invalid types
lib/pleroma/web/rich_media/parser/ttl.ex:6:unknown_type
Unknown type: Integer.t/0.

lib/pleroma/web/rich_media/parser/ttl.ex:6:unknown_type
Unknown type: Map.t/0.
2024-01-20 17:14:56 -05:00
Mark Felder
9896b64f54 Elixir 1.15: Chase the Logger.warn deprecation 2023-12-20 20:16:26 +00:00
Lain Soykaf
00def0875b RichMediaTest: Use mocked config 2023-12-12 13:28:11 +04:00
Mark Felder
0d68804aa7 Filter OEmbed HTML tags 2023-05-26 19:54:24 +02:00
lain
e853cfe7c3 Revert "Merge branch 'copyright-bump' into 'develop'"
This reverts merge request !3825
2023-01-02 20:38:50 +00:00
marcin mikołajczak
10886eeaa2 Bump copyright year
Signed-off-by: marcin mikołajczak <git@mkljczk.pl>
2023-01-01 12:13:06 +01:00
Sean King
17aa3644be
Copyright bump for 2022 2022-02-25 23:11:42 -07:00
Haelwenn (lanodan) Monnier
c4439c630f
Bump Copyright to 2021
grep -rl '# Copyright © .* Pleroma' * | xargs sed -i 's;Copyright © .* Pleroma .*;Copyright © 2017-2021 Pleroma Authors <https://pleroma.social/>;'
2021-01-13 07:49:50 +01:00
lain
e1e7e4d379 Object: Rework how Object.normalize works
Now it defaults to not fetching, and the option is named.
2021-01-04 13:38:31 +01:00
lain
713612c377 Cachex: Make caching provider switchable at runtime.
Defaults to Cachex.
2020-12-18 17:44:46 +01:00
Alexander Strizhakov
8d218ebaf5
Moving some background jobs into simple tasks
- fetching activity data
- attachment prefetching
- using limiter to prevent overload
2020-11-11 13:39:49 +03:00
Alexander Strizhakov
fc7151a9c4
more files renamings 2020-10-13 16:38:19 +03:00
Alexander Strizhakov
103f3dcb9e
rich media parser ttl files consistency 2020-10-13 16:38:15 +03:00
Mark Felder
8539e386c3 Add missing Copyright headers 2020-10-12 12:00:50 -05:00