Time |
Nickname |
Message |
00:00
🔗
|
timmc |
The WARC had the space in it? |
00:00
🔗
|
JAA |
Yes. |
00:00
🔗
|
timmc |
Now I wonder what browsers do when confronted with that. |
00:00
🔗
|
JAA |
They'll handle it fine, probably. |
00:01
🔗
|
JAA |
Browsers are developed to handle all sorts of crap thrown at them by badly written web servers. |
00:02
🔗
|
JAA |
(Unfortunately, because that means that the web servers never get fixed to conform to the standards.) |
00:02
🔗
|
JAA |
But I guess the IA library which handles this stuff is more strict. |
00:06
🔗
|
timmc |
OK, can confirm, Firefox is fine with it. |
00:09
🔗
|
voidsta |
vvvvvvv/13 |
00:09
🔗
|
voidsta |
oops |
00:14
🔗
|
xmc |
hi |
00:14
🔗
|
voidsta |
hello |
00:14
🔗
|
JAA |
Looks like others have experienced this problem of web servers including whitespace in the chunk size before: https://webcache.googleusercontent.com/search?q=cache:https%3A%2F%2Fjava.net%2Fjira%2Fbrowse%2FGRIZZLY%2D1684 (java.net shut down recently :-| ) |
00:15
🔗
|
joepie91 |
JAA: timmc: chunk sizes in the WARCs are padded to multiples of 3 hex chars |
00:15
🔗
|
joepie91 |
using spaces |
00:16
🔗
|
joepie91 |
(in my report, they're represented as dots) |
00:17
🔗
|
JAA |
Yeah, but not always. |
00:17
🔗
|
JAA |
Hmm, or maybe it is always. |
00:19
🔗
|
joepie91 |
from what I could see, it's always |
00:19
🔗
|
joepie91 |
just not all numbers in the source are chunk sizes :) |
00:21
🔗
|
JAA |
Found the Apache bug: https://bz.apache.org/bugzilla/show_bug.cgi?id=41364 |
00:21
🔗
|
JAA |
Although it seems unlikely that they were still using that version last year. :-P |
00:23
🔗
|
JAA |
Someone claims there that "The spaces padding the hex value are ok according to rfc2616" |
00:25
🔗
|
timmc |
joepie91: Yeah, I tested it with the added spaces and a fake HTTP server. |
00:26
🔗
|
JAA |
I'm glad that there are standards, but I hate the standards. They're so hard to read at times. |
00:26
🔗
|
timmc |
ugh implied LWS |
00:26
🔗
|
JAA |
Yes, but... |
00:27
🔗
|
JAA |
RFC 2616 was obsoleted by 7230. 7230 uses ABNF from RFC 5234. |
00:27
🔗
|
JAA |
5234 specifically states that "This specification for ABNF does not provide for implicit specification of linear white space." and "Any grammar that wishes to permit linear white space around delimiters or string segments must specify it explicitly." |
00:28
🔗
|
JAA |
And I can't find that in 7230 anywhere. |
00:29
🔗
|
JAA |
Actually, it states explicitly: "Rules about implicit linear whitespace between certain grammar productions have been removed; now whitespace is only allowed where specifically defined in the ABNF." |
00:29
🔗
|
JRWR |
man the scaleway API is strange |
00:30
🔗
|
|
Ravenloft has quit IRC (Read error: Operation timed out) |
00:45
🔗
|
|
VADemon has quit IRC (Read error: Connection reset by peer) |
00:45
🔗
|
joepie91 |
timmc: LWS? |
00:45
🔗
|
JAA |
Linear white space |
00:46
🔗
|
JAA |
Meaning white space (0x20) and horizontal tabs (0x09). |
00:48
🔗
|
timmc |
JAA: Too little too late, I suppose. |
00:48
🔗
|
JAA |
So here's what I think is happening on those pages: Apache, portalgraphics's web server, for some reason pads the chunk sizes with spaces to multiples of three. Since almost all clients probably support RFC 2616 (backwards compatibility etc.), this isn't actually a problem, although it isn't exactly conformant with the most up-to-date standards. (portalgraphics may have been using an old version of Apa |
00:49
🔗
|
JAA |
che though from before the release of RFC 7230.) |
00:49
🔗
|
JAA |
However, the IA library handling HTTP responses uses RFC 7230 and therefore doesn't allow whitespace after the chunk size. It fails to decode it and handles it as raw data instead, in effect simply dropping the "Transfer-Encoding: chunked" header. |
00:50
🔗
|
JAA |
Which then leads to "garbage" showing up in the final response from the IA. |
00:53
🔗
|
JRWR |
Anyone want my scaleway script |
00:53
🔗
|
JRWR |
it depolys 7 grab scripts at a time |
00:54
🔗
|
JRWR |
using the arm64 instances |
00:54
🔗
|
JRWR |
(2.99Euro) |
00:54
🔗
|
voidsta |
sure, share it :) |
00:54
🔗
|
JAA |
By the way, if you request the id_ resource for that link I gave above, the IA sends it in chunked transfer encoding again; the raw traffic back from IA to the browser then looks like double-chunk-encoded: https://gist.githubusercontent.com/anonymous/accf1455050dcf01f19a3b6d1f7cf658/raw/89f5ab19945c49e3770bb6571e36b9f2ae8f1594/gistfile1.txt |
00:55
🔗
|
JRWR |
and voidsta a script to clean up all the servers as well |
00:55
🔗
|
voidsta |
JRWR: cool :) |
00:55
🔗
|
JAA |
JRWR: I'd love to see how you automated it, although I won't be using it directly. I've been meaning to look into how to make the whole process of joining a new project a bit easier across multiple machines. |
00:55
🔗
|
JRWR |
ya |
00:56
🔗
|
JRWR |
its simple as fuck really |
00:56
🔗
|
voidsta |
same |
01:00
🔗
|
JRWR |
https://gist.github.com/JRWR/4b1cdbe0f55f00d92c10ff1e2355c5b7 |
01:00
🔗
|
JRWR |
there you go |
01:00
🔗
|
JRWR |
thats both scripts |
01:01
🔗
|
JRWR |
updated to show my script.sh it sent to the servers |
01:01
🔗
|
|
ajft has joined #archiveteam-bs |
01:01
🔗
|
JRWR |
mostly its the default one with a screen -dm on the run-pipeline |
01:02
🔗
|
voidsta |
cool, thanks for sharing |
01:03
🔗
|
JRWR |
its very shotgun style |
01:03
🔗
|
JRWR |
but it gets the job done |
01:05
🔗
|
|
ndiddy has quit IRC () |
01:05
🔗
|
JAA |
Thanks, I'll have a look at it tomorrow. |
01:06
🔗
|
JAA |
"Some twats drove a van into pedestrians and stabbed people. But don't despair, this will never happen again once we start regulating the internet." FFS, Theresa... |
01:12
🔗
|
|
j08nY has quit IRC (Quit: Leaving) |
01:14
🔗
|
xmc |
seems reasonable |
01:14
🔗
|
xmc |
worked here in the usa |
01:19
🔗
|
joepie91 |
JAA: interestingly, the chunk cutoffs happened in specific places a lot. I wonder whether you can infer where variables were used (in string-concatenation, in PHP) from the chunk cutoff poiints |
01:20
🔗
|
joepie91 |
JAA: also, if that theory is correct, it'd be relatively simple to fix all the WARCs |
01:21
🔗
|
xmc |
if modifying warcs is acceptable. |
01:22
🔗
|
xmc |
tbh i'm on the fence about that |
01:22
🔗
|
xmc |
even in this case |
01:22
🔗
|
JAA |
I guess it might be possible to infer something about internal buffer sizes etc. |
01:23
🔗
|
JAA |
Agreed. The WARCs contain an accurate representation of what the web server delivered to clients. The fact that some clients or libraries can't handle it is secondary to preserving the original data in my opinion. |
01:23
🔗
|
|
ndiddy has joined #archiveteam-bs |
01:25
🔗
|
JAA |
We definitely need to get in touch with the IA though so they can fix their software if my assumptions above are correct. |
01:25
🔗
|
|
BlueMaxim has joined #archiveteam-bs |
01:25
🔗
|
JAA |
I'm sure they have tons of other pages with the same "bug". |
01:27
🔗
|
xmc |
quite possibly! |
01:36
🔗
|
|
schbirid2 has joined #archiveteam-bs |
01:39
🔗
|
|
schbirid has quit IRC (Read error: Operation timed out) |
01:42
🔗
|
joepie91 |
okay, that was poorly worded |
01:42
🔗
|
joepie91 |
it'd be relatively simple to fix the wayback output* |
01:42
🔗
|
joepie91 |
:p |
01:42
🔗
|
joepie91 |
cc JAA xmc |
01:42
🔗
|
joepie91 |
definitely not advocating for [irrevocably] modifying source data |
01:42
🔗
|
xmc |
ah yep |
01:43
🔗
|
joepie91 |
but eg. storing a 'fixed' copy of the WARC can be desirable for perf purposes (over fixing stuff on-the-fly in the wayback) |
01:43
🔗
|
joepie91 |
without touching the original |
01:53
🔗
|
|
icedice has quit IRC (Ping timeout: 250 seconds) |
01:54
🔗
|
* |
JRWR spins up 100 instances |
01:54
🔗
|
JRWR |
oops |
02:03
🔗
|
voidsta |
:) |
02:04
🔗
|
|
pizzaiolo has quit IRC (Ping timeout: 260 seconds) |
02:26
🔗
|
|
JRWR has quit IRC (Quit: Page closed) |
03:15
🔗
|
|
superkuh has quit IRC (Quit: the neuronal action potential is an electrical manipulation of reversible abrupt phase changes in the lipid bilaye) |
03:24
🔗
|
|
Aranje has quit IRC (Quit: Three sheets to the wind) |
03:34
🔗
|
|
Sk1d has joined #archiveteam-bs |
03:44
🔗
|
|
ndiddy has quit IRC () |
03:51
🔗
|
|
ajft has left |
04:08
🔗
|
|
slyphic has quit IRC (Read error: Operation timed out) |
04:08
🔗
|
|
slyphic has joined #archiveteam-bs |
04:13
🔗
|
MrRadar |
Over in #outofsteam DoomTay noticed that SPUF was returning data with chunked encoding when he used wpull to grab their front page |
04:14
🔗
|
MrRadar |
I checked on my end and found out that they did that for both wpull and wget-lua if a custom user-agent is not specified |
04:14
🔗
|
MrRadar |
With the ArchiveTeam user-agent SPUF returns data without chunking |
04:15
🔗
|
MrRadar |
I've also verified that both wpull and wget-lua are producing WARCs with the same corruption as portalgraphics when SPUF returns data with chunked transfers |
04:24
🔗
|
MrRadar |
We've also figured out that using a browser User-agent still results in chunked transfers, but adding an Accept header like an actual browser would will cause it to switch back to non-chunked transfers |
04:29
🔗
|
MrRadar |
Can someone with access to the tracker please stop Pixiv? |
04:29
🔗
|
MrRadar |
The chunked transfer issue is 100% affecting grabs from them |
04:29
🔗
|
MrRadar |
xmc, arkiver, SketchCow: ^^^ |
04:37
🔗
|
MrRadar |
I'm grabbing the latest unpatched wget to see if that has the same issue as wpull and wget-lua |
04:46
🔗
|
MrRadar |
This issue affects the official wget 1.19 release |
04:52
🔗
|
MrRadar |
OK, I've tracked down the bug in the wget source code. |
04:53
🔗
|
MrRadar |
The way WARC writing works in wget is there are two output files passed to the fd_read_body() function |
04:53
🔗
|
MrRadar |
The first gets only the main content, the second gets both the content and headers |
04:53
🔗
|
MrRadar |
WARC output uses stores the data from the second stream into the WARC. |
04:54
🔗
|
MrRadar |
However, as the comment on the function says: "If OUT2 is non-NULL, the contents is also written to OUT2. OUT2 will get an exact copy of the response: if this is a chunked response, everything -- including the chunk headers -- is written to OUT2. (OUT will only get the unchunked response.)" |
04:54
🔗
|
MrRadar |
So it's a deliberate design decision to dump the chunked transfer size as part of the WARC output |
04:55
🔗
|
voidsta |
so, not a bug? |
04:56
🔗
|
MrRadar |
Actually, looks like it is a bug |
04:56
🔗
|
voidsta |
hm |
04:56
🔗
|
MrRadar |
According to the "WARC_ISO_28500_final_draft v018" document I found: "The payload of a 'response' record with a target-URI of scheme 'http' or 'https' is defined as its 'entity-body' (per [RFC2616]), with any transfer-encoding removed. If a truncated 'response' record block contains less than the full entity-body, the payload is considered truncated at the same position." |
04:56
🔗
|
MrRadar |
The "with any transfer-encoding" removed bit indicates that this is non-compliant behavior on the part of wget |
04:57
🔗
|
MrRadar |
As the chunked-transfer header would count as part of the transfer-encoding |
04:57
🔗
|
MrRadar |
Unless they mean something completely different than the HTTP spec when they are referring to "transfer-encoding" |
04:58
🔗
|
MrRadar |
(The document can be found here: http://archive-access.sourceforge.net/warc/WARC_ISO_28500_final_draft%20v018%20Zentveld%20080618.doc) |
05:00
🔗
|
MrRadar |
Well, I need to get to bed. See you in the morning |
05:00
🔗
|
* |
MrRadar is AFK |
05:01
🔗
|
|
Sk1d has quit IRC (Ping timeout: 194 seconds) |
05:06
🔗
|
|
ItsYoda has joined #archiveteam-bs |
05:07
🔗
|
|
Sk1d has joined #archiveteam-bs |
05:15
🔗
|
godane |
so i have only uploaded 37k items this year so far |
05:31
🔗
|
|
JRWR has joined #archiveteam-bs |
05:31
🔗
|
JRWR |
something is going down |
05:31
🔗
|
JRWR |
MrRadar: do you confirm? |
05:32
🔗
|
MrRadar |
I'm not 100% sure |
05:32
🔗
|
JRWR |
ill point my webserver to the ingress folder if you want to start checking |
05:32
🔗
|
MrRadar |
wget is saving the chunked transfer headers into the WARCs but I'm pretty sure that's against the WARC spec |
05:32
🔗
|
MrRadar |
But I'm definitely not an expert on the WARC spec |
05:33
🔗
|
MrRadar |
We'd probably need to hear for sure from somebody at the IA who knows the spec very well |
05:34
🔗
|
MrRadar |
I did confirm the WARCs I was uploading for Pixiv contained the hex garbage |
05:35
🔗
|
JRWR |
Shit |
05:35
🔗
|
JRWR |
I want a confirm with a OP |
05:36
🔗
|
JRWR |
but I will keep the ingress in case of something crazy happening |
05:37
🔗
|
JRWR |
MrRadar: http://spacescience.tech/warc/incoming-uploads/ |
05:37
🔗
|
JRWR |
you can start checking if you want |
05:38
🔗
|
MrRadar |
Picking at random this file chunked transfer headers in the roomtop.php response body: http://spacescience.tech/warc/incoming-uploads/JRWR/pixiv-roomtop_100594-20170605-044020.warc.gz |
05:39
🔗
|
JRWR |
ya I see them too |
05:39
🔗
|
JRWR |
Interesting |
05:40
🔗
|
JRWR |
im looking to see if there are any issues with the dumps |
05:42
🔗
|
JRWR |
Yep found some |
05:42
🔗
|
JRWR |
FUCK |
05:42
🔗
|
JRWR |
http://spacescience.tech/warc/incoming-uploads/Abel_LF/pixiv-roomtop_618874-20170604-145848.warc.gz |
05:42
🔗
|
JRWR |
Line 426 |
05:43
🔗
|
JRWR |
Shit |
05:43
🔗
|
JRWR |
there is some in the AMFs |
05:43
🔗
|
JRWR |
fffffffffffffff |
05:44
🔗
|
JRWR |
I extracted all the static files |
05:44
🔗
|
JRWR |
out of the 20, only 2 matched their SHA1s |
05:44
🔗
|
JRWR |
These are bad dumps |
05:45
🔗
|
JRWR |
Who do we ping MrRadar |
05:45
🔗
|
JRWR |
http://imgur.com/6NfmQ |
05:46
🔗
|
MrRadar |
I already tried pinging everyone who has tracker access, but none of them are online at the moment |
05:46
🔗
|
MrRadar |
In the mean time you could reduce your rsync to 1 connection max |
05:47
🔗
|
MrRadar |
Or just turn it off altogether |
05:49
🔗
|
JRWR |
rsync is OFFLINE |
05:49
🔗
|
MrRadar |
JRWR: Which AMF files are you seeing with this issue? In the one you linked none of the AMF files were transferred with chunked encoding |
05:51
🔗
|
JRWR |
my bad it was the PNGs |
05:51
🔗
|
MrRadar |
OK, yeah some of those are definitely affected |
05:53
🔗
|
JRWR |
felt a great disturbance in the Force, as if millions of voices suddenly cried out in terror and were suddenly silenced. I fear something terrible has happened. |
05:54
🔗
|
JRWR |
ok |
05:54
🔗
|
JRWR |
We got to fix this in the meantime |
05:55
🔗
|
JRWR |
its def Wget-lua doing this |
05:56
🔗
|
MrRadar |
Yeah, I actually tracked down the caust of the bug while you weren't online |
05:56
🔗
|
MrRadar |
Inside wget |
05:56
🔗
|
JRWR |
Good |
05:56
🔗
|
JRWR |
a simple fix is to disable http1.1 |
05:56
🔗
|
JRWR |
and ask for HTTP/1.0 |
05:56
🔗
|
JRWR |
but that does disable keepalives |
05:57
🔗
|
JRWR |
wait, how many dumps have been going on over the years with this issue? |
05:57
🔗
|
JRWR |
I wonder if anyone ever checked |
05:58
🔗
|
MrRadar |
While I haven't verified with the git history, this looks like it's been a problem since WARC support was first added to wget |
05:58
🔗
|
godane |
so i got a 256gb usb stick Saturday |
05:58
🔗
|
godane |
for $45 |
05:59
🔗
|
JRWR |
so.. |
05:59
🔗
|
JRWR |
thats ALL the dumps? |
06:00
🔗
|
MrRadar |
Any ones that have data transferred wiht the chunked transfer-encoding |
06:00
🔗
|
MrRadar |
Assuming my interpretation of the WARC spec is correct |
06:00
🔗
|
JRWR |
hrm |
06:00
🔗
|
MrRadar |
Given how extensive the issue is, it may be easier to just update the WARC spec to allow chunked transfer headers inside WARC response records |
06:00
🔗
|
JRWR |
true |
06:02
🔗
|
JRWR |
so the hex we are seeing are the headers for the next chunk? |
06:02
🔗
|
godane |
so the wget WARC code was screwing things up? |
06:02
🔗
|
MrRadar |
If I'm right, yes |
06:02
🔗
|
MrRadar |
But I'm not sure I am |
06:03
🔗
|
MrRadar |
When data is transferred with the HTTP "chunked" transfer-encoding, wget is writing the chunk headers into the WARC |
06:03
🔗
|
godane |
but wouldn't that cause the last few years of archiving to have problems |
06:03
🔗
|
pikhq |
Unless everyone's been misreading the spec the same way. |
06:04
🔗
|
MrRadar |
The WARC spec says "The payload of a 'response' record with a target-URI of scheme 'http' or 'https' is defined as its 'entity-body' (per [RFC2616]), with any transfer-encoding removed." |
06:04
🔗
|
godane |
but whatever this bug is its not with everything |
06:05
🔗
|
MrRadar |
Yes, only when the web server uses "Transfer-encoding: chunked" |
06:05
🔗
|
pikhq |
Not a *lot* of things used chunked encoding. |
06:05
🔗
|
ranma |
is there a "best way" to back up a reddit post |
06:05
🔗
|
godane |
ok |
06:05
🔗
|
ranma |
with a lot of collapsed comment threads? |
06:05
🔗
|
MrRadar |
Locally or with e.g. archivebot? |
06:05
🔗
|
ranma |
for archive bat |
06:05
🔗
|
ranma |
bot |
06:06
🔗
|
ranma |
lol |
06:06
🔗
|
ranma |
e.g. https://www.reddit.com/r/apple/comments/6ezhwm/iama_foxconn_insider_with_information_on_next_12/dienjss/?context=3 |
06:06
🔗
|
ranma |
er |
06:06
🔗
|
ranma |
https://www.reddit.com/r/apple/comments/6ezhwm/iama_foxconn_insider_with_information_on_next_12 |
06:06
🔗
|
godane |
that at least means we shouldn't have alot of corrupt data |
06:06
🔗
|
MrRadar |
!a https://www.reddit.com/r/apple/comments/6ezhwm/iama_foxconn_insider_with_information_on_next_12/ without an ignore set should do the trick I think? |
06:06
🔗
|
MrRadar |
(Make sure to have the trailing slash) |
06:07
🔗
|
pikhq |
godane: It also implies it should be possible to find all of the data corrupted by this bug. |
06:07
🔗
|
pikhq |
Though the act of finding all of it is definitely a big one just because of how much data there is to sift through... |
06:10
🔗
|
ranma |
MrRadar: isn't that going to hit all the linked sites |
06:10
🔗
|
ranma |
and then maybe a stupid number of other sites? |
06:10
🔗
|
ranma |
!a scares me |
06:10
🔗
|
MrRadar |
!a only recurses into URLs with the same prefix |
06:11
🔗
|
MrRadar |
URLs with a different prefix will be visited but not recursively |
06:11
🔗
|
|
Igloo has quit IRC (Read error: Operation timed out) |
06:11
🔗
|
MrRadar |
That's why the trailing slash would be so important, to limit the scope of the recursion |
06:11
🔗
|
ranma |
i've seen !a of example.com start to crawl marthastewart.com |
06:11
🔗
|
ranma |
hm |
06:12
🔗
|
ranma |
not sure if i used trailing slash |
06:19
🔗
|
SketchCow |
What's the upshot of the bug |
06:20
🔗
|
MrRadar |
SketchCow: When web servers return data with "Transfer-encoding: chunked" wget is saving information into the WARC that (I think?) the spec says should be stipped |
06:20
🔗
|
MrRadar |
Specifically, the size of each data chunk |
06:21
🔗
|
pikhq |
Everything sent from servers using chunked transfer encoding will have spurious hex digits and \r\n sequences in the data that were on the wire, but apparently WARC says aren't supposed to be there. |
06:21
🔗
|
pikhq |
(that is, in the file itself) |
06:21
🔗
|
MrRadar |
You should ask someone at the IA who is familiar with the WARC format about what the right way to handle chunked transfers is |
06:21
🔗
|
MrRadar |
It's possible I'm just reading the spec wrong and wget is doing it right |
06:22
🔗
|
pikhq |
https://github.com/iipc/warc-specifications/issues/22 |
06:22
🔗
|
pikhq |
That seems to imply you're reading the spec wrong. |
06:24
🔗
|
MrRadar |
pikhq: Reading that discussion I think you're right |
06:25
🔗
|
pikhq |
At the least, it's clear the *intention* is wget's behavior. |
06:25
🔗
|
MrRadar |
Yes, they're very deliberately including the headers in the WARC |
06:26
🔗
|
pikhq |
So, if you want to process WARC stuff (for rendering or what have you) you should probably be careful to take into account the transfer encoding, or else you'll get the spurious hex digits and such. |
06:26
🔗
|
pikhq |
But if you're generating a WARC, that's supposed to be there. |
06:26
🔗
|
MrRadar |
That makes sense |
06:27
🔗
|
MrRadar |
Sorry for the false alarm everyone |
06:27
🔗
|
MrRadar |
JRWR: If you're still around, please restart your rsync target |
06:27
🔗
|
pikhq |
No worries. The standard text is genuinely confusing, and your interpretation is a valid one. |
06:27
🔗
|
JRWR |
Been done already |
06:27
🔗
|
pikhq |
(at least, if you're not reading the exact same way they are) |
06:31
🔗
|
|
JRWR_ has joined #archiveteam-bs |
06:34
🔗
|
|
JRWR has quit IRC (Ping timeout: 268 seconds) |
06:36
🔗
|
|
JRWR_ is now known as JRWR |
06:36
🔗
|
JRWR |
So overall that means IA's Wayback Machine doesn't follow the spec as well then |
06:37
🔗
|
MrRadar |
I think the issue with portalgraphics was they were sending slightly malformed chunked encoding headers |
06:37
🔗
|
MrRadar |
With extra padding? |
06:37
🔗
|
MrRadar |
That the IA didn't handle but browsers did |
06:37
🔗
|
MrRadar |
If my review of the logs is correct |
06:37
🔗
|
MrRadar |
*chat logs |
06:50
🔗
|
JRWR |
SketchCow: Looks like we got blacklisted at pixiv |
06:51
🔗
|
MrRadar |
arkiver: ^^^ |
06:52
🔗
|
MrRadar |
It's not by IP since I can view URLs that fail through wget-lua just fine in my browser |
06:55
🔗
|
MrRadar |
Pixiv appears to be running again |
06:55
🔗
|
JRWR |
Ya |
06:55
🔗
|
JRWR |
It looks like we got funneled |
07:10
🔗
|
|
Whopper_ has joined #archiveteam-bs |
07:13
🔗
|
|
Whopper has quit IRC (Ping timeout: 268 seconds) |
08:00
🔗
|
|
SHODAN_UI has joined #archiveteam-bs |
08:44
🔗
|
|
Nazca_ has joined #archiveteam-bs |
08:45
🔗
|
Nazca_ |
funneled is good or bad? |
08:45
🔗
|
|
Nazca has quit IRC (Read error: Operation timed out) |
08:45
🔗
|
|
Nazca_ is now known as Nazca |
08:55
🔗
|
|
Igloo has joined #archiveteam-bs |
09:17
🔗
|
godane |
Donald Trump on Charlie Rose: https://archive.org/details/Charlie-Rose-1992-11-06 |
09:24
🔗
|
|
kristian_ has joined #archiveteam-bs |
09:25
🔗
|
|
jtn2 has joined #archiveteam-bs |
09:29
🔗
|
|
jtn2 has quit IRC (Read error: Operation timed out) |
09:31
🔗
|
|
jtn2 has joined #archiveteam-bs |
09:33
🔗
|
|
SHODAN_UI has quit IRC (Remote host closed the connection) |
09:35
🔗
|
godane |
i'm close to half way point of uploads from last month |
09:36
🔗
|
godane |
i only uploaded 955 items last month |
09:36
🔗
|
godane |
i was grabbing the Mister Rogers stream and ripping tape this past month |
09:40
🔗
|
|
jtn2 has quit IRC (Read error: Operation timed out) |
09:42
🔗
|
|
jtn2 has joined #archiveteam-bs |
10:07
🔗
|
|
jtn2 has quit IRC (Read error: Operation timed out) |
10:12
🔗
|
JAA |
06-05 06:37:06 < MrRadar> I think the issue with portalgraphics was they were sending slightly malformed chunked encoding headers -- Yes, that's how I understand it. Interesting that the WARC should have transfer encoding stripped. I guess it makes sense in a way though. |
10:18
🔗
|
|
jtn2 has joined #archiveteam-bs |
10:23
🔗
|
JAA |
But all in all, I don't think we need to stop current projects or anything like that. It wouldn't be hard to fix WARCs retroactively at some point if we want to do that. |
10:25
🔗
|
JAA |
joepie91: Fixing it in the Wayback Machine should be easy. IA's library for handling HTTP responses in WARC files already deals with chunked encoding, just not with this "malformed" variant. No need to update WARCs or anything; instead, the library should be modified to handle the whitespace padding. |
10:27
🔗
|
|
j08nY has joined #archiveteam-bs |
10:27
🔗
|
Sanqui |
JAA: can you make some sort of writeup so this information doesn't get lost if somebody doesn't get to it right away? |
10:29
🔗
|
JAA |
Sanqui: Yeah, sure. |
10:42
🔗
|
|
jtn2 has quit IRC (Ping timeout: 250 seconds) |
10:43
🔗
|
|
jtn2 has joined #archiveteam-bs |
10:43
🔗
|
|
BlueMaxim has quit IRC (Read error: Operation timed out) |
10:44
🔗
|
|
BlueMaxim has joined #archiveteam-bs |
11:29
🔗
|
|
BlueMaxim has quit IRC (Ping timeout: 600 seconds) |
11:30
🔗
|
|
BlueMaxim has joined #archiveteam-bs |
11:55
🔗
|
|
SHODAN_UI has joined #archiveteam-bs |
12:07
🔗
|
|
kristian_ has quit IRC (Quit: Leaving) |
12:43
🔗
|
|
tfgbd_znc has quit IRC (Ping timeout: 600 seconds) |
12:52
🔗
|
JAA |
Anyone want to archive this? ;-) https://www.bleepingcomputer.com/news/security/hadoop-servers-expose-over-5-petabytes-of-data/ |
12:53
🔗
|
|
BlueMaxim has quit IRC (Quit: Leaving) |
13:49
🔗
|
|
superkuh has joined #archiveteam-bs |
13:57
🔗
|
joepie91 |
"To put things in perspective, HDFS servers leak 200 times more data compared to MongoDB servers, which are ten times more prevalent." |
13:57
🔗
|
joepie91 |
~big data~ |
13:58
🔗
|
joepie91 |
JAA: hmm. the WARC stores the original chunked data in the WARC? |
13:58
🔗
|
joepie91 |
ie. the stream of bytes as it appeared over the wire |
13:58
🔗
|
joepie91 |
(as opposed to it beiing turned into just the content) |
13:58
🔗
|
JRWR |
I do find that strange for a format like WARC |
14:02
🔗
|
joepie91 |
MrRadar: JRWR: please make sure to confirm intended WARC behaviour with somebody who has access to the *final* WARC spec, to ensure that nothing was changed from the draft |
14:03
🔗
|
JRWR |
We did |
14:03
🔗
|
joepie91 |
JRWR: does something still need to be disabled on the tracker? |
14:03
🔗
|
* |
joepie91 has tracker access |
14:03
🔗
|
joepie91 |
(I'm still reading backlog) |
14:03
🔗
|
JRWR |
there is a issue open on the WARC Spec Github that explains the issue |
14:03
🔗
|
JRWR |
and currently wget is correct in its saving |
14:04
🔗
|
JRWR |
right now we are being throttled HARD by pixiv |
14:05
🔗
|
JRWR |
442053done + 94722out + 463227to do |
14:05
🔗
|
Kalroth |
they hit the anti-DDoS panic button |
14:07
🔗
|
joepie91 |
JRWR: right, if something needs to be changed on the tracker and nobody is around, ping me :P |
14:07
🔗
|
joepie91 |
(pinging me on Freenode results in faster responses) |
14:07
🔗
|
JRWR |
Ah |
14:07
🔗
|
JRWR |
Its OK for now, kind of wish pixiv had not throttled us |
14:08
🔗
|
joepie91 |
I'm going to be pretty busy today though, so preferably include a very precise request of what needs changing so that it's just a few clicks for me and doesn't require extra thinking :P |
14:11
🔗
|
JRWR |
of course joepie91 |
14:12
🔗
|
JRWR |
The only warning I've got on my dash right now is my storage is now half full |
14:14
🔗
|
|
pizzaiolo has joined #archiveteam-bs |
14:15
🔗
|
|
icedice has joined #archiveteam-bs |
14:22
🔗
|
MrRadar |
joepie91: Yeah, after reading the spec issue on Github I initially reading the spec wrong and wget is doing the right thing |
14:22
🔗
|
MrRadar |
I was confused about what the portalgraphics issue was earlier |
14:23
🔗
|
MrRadar |
I missed that it was due to *extra whitespace* in their chunked transfer headers that was the issue |
14:23
🔗
|
MrRadar |
Not the headers themselves |
15:08
🔗
|
|
SHODAN_UI has quit IRC (Quit: zzz) |
15:08
🔗
|
|
SHODAN_UI has joined #archiveteam-bs |
15:27
🔗
|
JAA |
joepie91: Yes, as far as I can tell, wget and wpull store the raw data stream in the WARCs. In a way, that's exactly what I'd expect, although I can also see some arguments for stripping transfer encoding first. |
15:28
🔗
|
JAA |
On a related note, I find it interesting that TLS certificates aren't stored in WARCs. |
15:39
🔗
|
joepie91 |
JAA: that might just be a wget thing? I know that Heritrix stores a lot more stuff in WARCs than wget does, even down to DNS requests and responses |
15:40
🔗
|
JAA |
Oh yeah, DNS as well. |
15:40
🔗
|
JAA |
That's very well possible. |
15:44
🔗
|
JAA |
joepie91: Do you have an example Heritrix WARC? I'd like to know how they store those things. |
15:46
🔗
|
joepie91 |
JAA: I don't, unfortunately. somebody in here has made some in the past |
15:46
🔗
|
joepie91 |
but that was a few years ago :) |
15:56
🔗
|
|
icedice has quit IRC (Ping timeout: 245 seconds) |
16:28
🔗
|
|
icedice has joined #archiveteam-bs |
17:11
🔗
|
|
JRWR has quit IRC (Ping timeout: 268 seconds) |
17:16
🔗
|
|
ZexaronS has joined #archiveteam-bs |
17:50
🔗
|
|
dashcloud has quit IRC (Ping timeout: 260 seconds) |
17:54
🔗
|
|
fie has quit IRC (Read error: Operation timed out) |
18:31
🔗
|
|
za3k has joined #archiveteam-bs |
18:31
🔗
|
za3k |
#internetarchive |
18:32
🔗
|
za3k |
i'm an idiot, ignore |
18:33
🔗
|
|
Rai-chan has quit IRC (Ping timeout: 268 seconds) |
18:33
🔗
|
za3k |
What I meant to say is: https://za3k.com/github/ is back up and actively archiving the summary metadata of github projects (mostly names and ids) |
18:33
🔗
|
za3k |
ghtorrent.org is pretty much strictly better, does anyone already have a copy? |
18:34
🔗
|
|
Jon has quit IRC (Ping timeout: 268 seconds) |
18:35
🔗
|
|
Jon has joined #archiveteam-bs |
18:37
🔗
|
|
Aoede has quit IRC (Ping timeout: 268 seconds) |
18:37
🔗
|
|
purplebot has quit IRC (Ping timeout: 268 seconds) |
18:37
🔗
|
|
Aoede has joined #archiveteam-bs |
18:38
🔗
|
|
fie has joined #archiveteam-bs |
18:43
🔗
|
|
purplebot has joined #archiveteam-bs |
18:43
🔗
|
|
Rai-chan has joined #archiveteam-bs |
19:01
🔗
|
|
SHODAN_UI has quit IRC (Remote host closed the connection) |
19:06
🔗
|
|
xmc has quit IRC (Read error: Operation timed out) |
19:09
🔗
|
|
xmc has joined #archiveteam-bs |
19:09
🔗
|
|
swebb sets mode: +o xmc |
19:28
🔗
|
SketchCow |
FOS is now back to half-full, although you maniacs could probably fill it if you tried |
19:28
🔗
|
|
JRWR has joined #archiveteam-bs |
19:33
🔗
|
|
za3k has quit IRC (Quit: http://chat.efnet.org (EOF)) |
20:03
🔗
|
* |
zino whistles innocently. |
20:37
🔗
|
|
gui7 has joined #archiveteam-bs |
20:37
🔗
|
|
gui7 has left LIST |
20:38
🔗
|
|
gui7 has joined #archiveteam-bs |
20:39
🔗
|
|
gui7 has quit IRC (Remote host closed the connection) |
20:39
🔗
|
|
gui7 has joined #archiveteam-bs |
20:40
🔗
|
|
SHODAN_UI has joined #archiveteam-bs |
21:48
🔗
|
|
icedice has quit IRC (Quit: Leaving) |
21:49
🔗
|
|
gui7 has quit IRC (Leaving.) |
21:53
🔗
|
deathy |
SketchCow: maybe update http://www.archiveteam.org/index.php?title=Rescuing_optical_media in case you know of better tools now? I'm also working through a backlog of personal CD/DVDs now... |
22:42
🔗
|
|
dashcloud has joined #archiveteam-bs |
23:01
🔗
|
|
yakfish has quit IRC (Operation timed out) |
23:06
🔗
|
|
SHODAN_UI has quit IRC (Remote host closed the connection) |
23:10
🔗
|
wp494 |
deathy: anyone can |
23:20
🔗
|
|
twigfoot has joined #archiveteam-bs |
23:37
🔗
|
Odd0002 |
I used readom for my ISOs and they all seem to work fine in a windows 98SE VM |
23:39
🔗
|
|
ndiddy has joined #archiveteam-bs |
23:42
🔗
|
|
GLaDOS has joined #archiveteam-bs |