• 1 Post
  • 660 Comments
Joined 2 years ago
cake
Cake day: July 14th, 2023

help-circle
  • I think the best way to handle this would be to just encode everything and upload all files. If I wanted some amount of history, I’d use some file system with automatic snapshots, like ZFS.

    If I wanted to do what you’ve outlined, I would probably use rclone with filtering for the extension types or something along those lines.

    If I wanted to do this with Git specifically, though, this is what I would try first:

    First, add lossless extensions (*.flac, *.wav) to my repo’s .gitignore

    Second, schedule a job on my local machine that:

    1. Watches for changes to the local file system (e.g., with inotifywait or fswatch)
    2. For any new lossless files, if there isn’t already an accompanying lossy files (i.e., identified by being collocated, having the exact same filename, sans extension, with an accepted extension, e.g., .mp3, .ogg - possibly also with a confirmation that the codec is up to my standards with a call to ffprobe, avprobe, mediainfo, exiftool, or something similar), it encodes the file to your preferred lossy format.
    3. Use git status --porcelain to if there have been any changes.
    4. If so, run git add --all && git commit --message "Automatic commit" && git push
    5. Optionally, automatically craft a better commit message by checking which files have been changed, generating text like Added album: "Satin Panthers - EP" by Hudson Mohawke or Removed album: "Brat" by Charli XCX; Added album "Brat and it's the same but there's three more songs so it's not" by Charli XCX

    Third, schedule a job on my remote machine server that runs git pull at regular intervals.

    One issue with this approach is that if you delete a file (as opposed to moving it), the space is not recovered on your local or your server. If space on your server is a concern, you could work around that by running something like the answer here (adjusting the depth to an appropriate amount for your use case):

    git fetch --depth=1
    git reflog expire --expire-unreachable=now --all
    git gc --aggressive --prune=all
    

    Another potential issue is that what I described above involves having an intermediary git to push to and pull from, e.g., running on a hosted Git forge, like GitHub, Codeberg, etc… This could result in getting copyright complaints or something along those lines, though.

    Alternatively, you could use your server as the git server (or check out forgejo if you want a Git forge as well), but then you can’t use the above trick to prune file history and save space from deleted files (on the server, at least - you could on your local, I think). If you then check out your working copy in a way such that Git can use hard links, you should at least be able to avoid needing to store two copies on your server.

    The other thing to check out, if you take this approach, is git lfs. EDIT: Actually, I take that back - you probably don’t want to use Git LFS.



  • Copied from the post:

    You may have seen reports of leaks of older text messages that had previously been sent to Steam customers. We have examined the leak sample and have determined this was NOT a breach of Steam systems.

    We’re still digging into the source of the leak, which is compounded by the fact that any SMS messages are unencrypted in transit, and routed through multiple providers on the way to your phone.​

    The leak consisted of older text messages that included one-time codes that were only valid for 15-minute time frames and the phone numbers they were sent to. The leaked data did not associate the phone numbers with a Steam account, password information, payment information or other personal data. Old text messages cannot be used to breach the security of your Steam account, and whenever a code is used to change your Steam email or password using SMS, you will receive a confirmation via email and/or Steam secure messages.​

    You do not need to change your passwords or phone numbers as a result of this event. It is a good reminder to treat any account security messages that you have not explicitly requested as suspicious. We recommend regularly checking your Steam account security at any time at ​

    https://store.steampowered.com/account/authorizeddevices

    We also recommend setting up the Steam Mobile Authenticator if you haven’t already, as it gives us the best way to send secure messages about your account and your account’s safety.


  • Assuming you’re using ollama (is there another reason to use ollama.com?), you can use compatible files from huggingface directly in ollama. The model page will give you the instructions for the command to run; I always change ollama run to ollama pull , though. Instructions: https://huggingface.co/docs/hub/ollama

    You should be able to fit Qwen3 32B at Q4_K_M with an acceptable context, and it did very well on math benchmarks (with thinking enabled). You can disable thinking by including /no_think at the end of your prompt to speed up responses, but I’m not sure how well it handles math under those circumstances. I wouldn’t even consider disabling thinking unless you were grading one question per prompt.

    The ollama Qwen3 page is https://ollama.com/library/qwen3:32b and the default 32B quant is Q4_K_M. I personally am using the Q6_K quant by unsloth, and their quants have been great (when supported by ollama), often being the first to fix bugs impacting other quantizations.

    I’m not sure if Q4_K_M is the optimal quant style for Intel Arc, but the others that might be better are not supported by ollama, anyway, as far as I know.

    Qwen3’s real world knowledge is bad, so if there are questions that rely on that you may need to include the relevant facts as part of the prompt or use an ollama frontend that supports web searches.

    Other options: This does seem like something Gemma3 27B would be good at, so it’s too bad you can’t use it. Older Gemmas may be good, but I’m not sure. Llama3.3 70B is also out, unless you have a decent amount of system RAM and are okay with offloading less than half to GPU. I could see it outperforming my recommendation below but I would be very surprised for the 8B version to outperform it. Older Qwen2.5 is decent at math but unless you grab QwQ doesn’t include thinking.


  • You don’t have to finish the file to share it though, that’s a major part of bittorrent. Each peer shares parts of the files that they’ve partially downloaded already. So Meta didn’t need to finish and share the whole file to have technically shared some parts of copyrighted works. Unless they just had uploading completely disabled,

    The argument was not that it didn’t matter if a user didn’t download the entirety of a work from Meta, but that it didn’t matter whether a user downloaded anything from Meta, regardless of whether Meta was a peer or seed at the time.

    Theoretically, Meta could have disabled uploading but not blocked their client from signaling that they could upload. This would, according to that argument, still counts as reproducing the works, under the logic that signaling that it was available is the same as “making it available.”

    but they still “reproduced” those works by vectorizing them into an LLM. If Gemini can reproduce a copyrighted work “from memory” then that still counts.

    That’s irrelevant to the plaintiff’s argument. And beyond that, it would need to be proven on its own merits. This argument about torrenting wouldn’t be relevant if LLAMA were obviously a derivative creation that wasn’t subject to fair use protections.

    It’s also irrelevant if Gemini can reproduce a work, as Meta did not create Gemini.

    Does any Llama model reproduce the entirety of The Bedwetter by Sarah Silverman if you provide the first paragraph? Does it even get the first chapter? I highly doubt it.

    By the same logic, almost any computer on the internet is guilty of copyright infringement. Proxy servers, VPNs, basically any compute that routed those packets temporarily had (or still has for caches, logs, etc) copies of that protected data.

    There have been lawsuits against both ISPs and VPNs in recent years for being complicit in copyright infringement, but that’s a bit different. Generally speaking, there are laws, like the DMCA, that specifically limit the liability of network providers and network services, so long as they respect things like takedown notices.



  • Further, “Whether another user actually downloaded the content that Meta made available” through torrenting “is irrelevant,” the authors alleged. “Meta ‘reproduced’ the works as soon as it made them available to other peers.”

    Is there existing case law for what making something “available” means? If I say “Alright, I’ll send you this book if you want, just ask,” have I made it available? What if, when someone asks, I don’t actually send them anything?

    I’m thinking outside of contexts of piracy and torrenting, to be clear - like if a software license requires you to make any changed versions available to anyone who uses the software. Can you say it’s available if your distribution platform is configured to prevent downloads?

    If not, then why would it be any different when torrenting?

    Meta ‘reproduced’ the works as soon as it made them available to other peers.

    The argument that a copyrighted work has been reproduced when “made available,” when “made available” has such a low bar is also perplexing. If I post an ad on Craigslist for the sale of the Mona Lisa, have I reproduced it?

    What if it was for a car?

    I’m selling a brand new 2026 Alfa Romeo 4E, DM me your offers. I’ve now “reproduced” a car - come at me, MPAA.





  • It’s okay, the author of the article didn’t actually read (or understand) the Copyright Office’s recommendations. They are:

    Based on an analysis of copyright law and policy, informed by the many thoughtful comments in response to our NOI, the Office makes the following conclusions and recommendations:

    • Questions of copyrightability and AI can be resolved pursuant to existing law, without the need for legislative change.
    • The use of AI tools to assist rather than stand in for human creativity does not affect the availability of copyright protection for the output.
    • Copyright protects the original expression in a work created by a human author, even if the work also includes AI-generated material.
    • Copyright does not extend to purely AI-generated material, or material where there is insufficient human control over the expressive elements.
    • Whether human contributions to AI-generated outputs are sufficient to constitute authorship must be analyzed on a case-by-case basis.
    • Based on the functioning of current generally available technology, prompts do not alone provide sufficient control.
    • Human authors are entitled to copyright in their works of authorship that are perceptible in AI-generated outputs, as well as the creative selection, coordination, or arrangement of material in the outputs, or creative modifications of the outputs.
    • The case has not been made for additional copyright or sui generis protection for AI- generated content.

    Pretty much everything the article’s author stated is contradicted by the above.



  • A paid skillful engineer, who doesn’t think it’s important to make that sort of a change and who knows how the system works, will know that, if success is judged solely by “does it work?” then the effort is doomed for failure. Such an engineer will push to have the requirements written clearly and explicitly - “how does it function?” rather than “what are the results?” - which means that unless the person writing the requirements actually understands the solution, said solution will end up having its requirements written such that even if it’s defeated instantly, it will count as a success. It met the specifications, after all.





  • Wouldn’t be a huge change at this point. Israel has been using AI to determine targets for drone-delivered airstrikes for over a year now.

    https://en.m.wikipedia.org/wiki/AI-assisted_targeting_in_the_Gaza_Strip gives a high level overview of Gospel and Lavender, and there are news articles in the references if you want to learn more.

    This is at least being positioned better than the ways Lavender and Gospel were used, but I have no doubt that it will be used to commit atrocities as well.

    For now, OpenAI’s models may help operators make sense of large amounts of incoming data to support faster human decision-making in high-pressure situations.

    Yep, that was how they justified Gospel and Lavender, too - “a human presses the button” (even though they’re not doing anywhere near enough due diligence).

    But it’s worth pointing out that the type of AI OpenAI is best known for comes from large language models (LLMs)—sometimes called large multimodal models—that are trained on massive datasets of text, images, and audio pulled from many different sources.

    Yes, OpenAI is well known for this, but they’ve also created other types of AI models (e.g., Whisper). I suspect an LLM might be part of a solution they would build but that it would not be the full solution.


  • Both devices have integrated memory, so that 16 GB will look more like a 11/5, 12/4, or maybe even 14/2 split. The Steam Deck is also $400 for an LCD model or $550 for the OLED, not $800. It’s reasonable to expect more performance when you pay more.

    Because the Steam Deck has a lower native resolution, that means that less of the RAM will be used for the integrated GPU. Downscaling from 1080p to 720p doesn’t look good, either - and you could downscale to 540p if supported, but if you need to do that (vs choosing to for an emulated game) it probably won’t be pretty, either.

    This device is also running Windows, rather than a streamlined Linux-based launcher, meaning that more of that RAM will be taken up by OS processes by default.

    The article talks about how the 8840U benefits from more, fast RAM. You won’t get near the 8840U’s full potential gaming with 16 GB. 24 GB, on the other hand, would have been enough that games expecting 16 GB of system RAM would have been able to get it, even while devoting 6-7 GB to the GPU and 1-2 GB to the OS.


  • Unless something has changed, it did. The page linked reads:

    And, obviously, this POC is open source, the code is publish here on our forge.

    The link takes you to their repos. The server repo has instructions on self-hosting directly on your server or with Docker. The app repo has code for both the iOS and Android apps. That’s good, because the iOS app at least doesn’t have a built-in way to select a different backend server.

    Whisper is by OpenAI and as far as I know they have not shared the training code, much less the data sets, so the best you can do is fine-tune the models they’ve provided.

    If use of Whisper is a problem, but the project is otherwise interesting to you, you could ask them to consider using a different STT solution (or allowing the user to choose between different options). I’m not aware of any fully open STT applications that are considered to be as capable as Whisper, but if you do, that would be great info to share with them.