Hello Box developer community,
I hope you are all doing well. I have a query regarding the download of text representations for files, specifically for files that are quite large, for instance, around 5GB in size.
I understand that Box provides the ability to download files using their API, and I’ve been successfully using it for smaller files. However, I’d like to know if there are any limitations or best practices when it comes to downloading the text representation of large files, especially when they are significantly larger, like the 5GB example.
Specifically, I’m interested in knowing:
- Size Limitations: Are there any size limitations when it comes to downloading the text representation of a file? Is there a maximum file size that should be considered when requesting text content?
- API Rate Limit: Does Box have any rate limits or throttling in place for downloading large files? Are there any guidelines on how to handle rate limits, especially for large downloads?
- Optimizations: Are there any recommended techniques or best practices for optimizing the download process for large files to ensure efficiency and reliability?
- Error Handling: How should errors be handled when dealing with large file downloads? Are there specific error codes or mechanisms to be aware of when downloading large text representations?
I would greatly appreciate your insights and guidance on this matter. Any tips, best practices, or technical details would be immensely helpful in ensuring a smooth and efficient download process, especially when working with large files.
Hi @user112 , welcome to the forum.
Text file representations are limited to files smaller than 500 megabytes
You could try to download the binary and then plugin some sort of file extractor for the type of files you’re interested. But that, of course, outside of box.
For the API rate limits you can take a look here.
As for download optimizations, not much can be said.
- A download file endpoint is available, this will stream the file.
- It accepts a range header, so you can restart a failed download, or download in sections
- Instead of immediately starting the download you can request the download URL for the file.
There are a few interesting response code that you can take a look here, however no specific error codes, just the default client error, which will contain context about the error.
Here is a curl example on how to get the download url:
❯ curl -i -X GET 'https://api.box.com/2.0/files/1282558511396/content' \
--header 'Authorization: Bearer sW...INJ'
date: Tue, 17 Oct 2023 18:00:02 GMT
cache-control: no-cache, no-store
via: 1.1 google
alt-svc: h3=":443"; ma=2592000,h3-29=":443"; ma=2592000
If you can let us know a bit more about your use case and we can ask around.
Thank you for the detailed response. It’s much appreciated!
In my use case, I’m working with binary data (bytes) received from Box.com, and I need to convert it to string. I’m wondering if you could share some insights into the encoding that Box.com uses for this conversion. Knowing the encoding would be very helpful, as it would enable me to use Python plugins to extract the data more effectively.
It should be UTF-8 encoding.
On a side note…
I’ve been discussing this internally, and the 500 mbytes limit is in place.
However folks on my side seem quite open to revisit this topic, especially from an AI perspective, where we can have big files to process, but these don’t necessarily yield that big text version to send to an LLM.
Having said that, and assuming you are familiar with how product managers works, I would kindly ask you to put this as an idea on our Box Pulse.
This type of tool is how or PM’s track requests and manage product road map. The more we have the better.
Also if you are or represent a customer, mention which too.