[00:05] When I had a monthly transfer limit my ISP would inject a warning into HTML pages when I was close to the limit [00:06] Which is "evil" because they're modifying stuff in transit, but I guess it's not malicious. Just something I'd really rather they did not do [00:06] It's moot now since I don't have a transfer cap anymore [00:07] Yeah, that's one that was mentioned in the earlier discussion in #warrior. [00:07] I have a considerably lower opinion of the practice of redirecting NXDOMAINs to ad serving pages [00:08] That can fuck right off [00:13] although for the HTML injection, it does bother me in that it implies that they have created infrastructure to parse and modify data in transit [00:14] Yeah. Last time they did that, I saved a copy of the page, it's sitting around here somewhere. [00:15] I was thinking it'd be interesting to pull some signatures out of that and have some sort of proactive detection of such hijinks, instead of relying on users' caution. [00:16] Yes, in fact, I asked for exactly that in #warrior before so we can add this. [00:16] I think I've seen some warrior scripts that have checks of some kind for tampering [00:16] I don't remember anything that matters about it though [00:16] Yes, but it's only very basic checks. [00:16] Covers WiFi portals for example. [00:17] If better checks could be implemented in the Warrior itself that all projects can benefit from, that'd be neat [00:17] It simply queries the IPs for a few well-known hosts (Twitter, Facebook, Google, and a few others) and verifies that they're all different. [00:17] ah [00:17] yeah, I didn't remember what it actually did [00:17] Well, it shouldn't be in the warrior, but yes. [00:19] *** jamiew has joined #archiveteam-ot [00:21] The Warrior couldn't fetch some URLs and check for tampering? [00:21] The warrior is only the VM. [00:21] Well, the once-in-a-blue-moon interception wouldn't happen very often, you'd need to grep for the tamper signatures in each warc, I think [00:22] Those checks you mentioned are currently in the pipeline scripts, but it might be a good idea to move them to seesaw. [00:22] ah, right [00:22] I meant the thing that runs in the warrior that orchestrates running jbos [00:22] jobs* [00:22] Yep, that's seesaw. [00:22] like if you get the bandwidth warning or something, you can't trigger that by just requesting a page [00:22] Hmm, it's not injected into every page? [00:23] no, when I used to get the maintenance reminder, it'd come up in like one or two pages and the rest of my session was unmolested [00:23] Depends on how the ISP implements it [00:23] Right [00:24] Ok, better solution: always retrieve everything over HTTPS, and if something requires HTTP, only run that on trusted machines/connections. [00:24] Trusted how? [00:24] then you need a way to track trustedness, but yeah [00:24] Well, connected over providers that aren't doing this shit. [00:25] If a datacentre started doing this, they'd go bankrupt in days. [00:25] So that's a start. [00:25] Basically, no consumer ISPs. [00:26] Although there are of course ones that actually do their job properly. [00:27] My ISP would be one of those that does this (unless they don't do it anymore), but they don't do it to *me* since I don't have a transfer cap [00:27] yeah, I avoided running a warrior for years because I wasn't confident that I'd completely opted out of that stuff -- there's no confirmation -- but it's been years since I've seen it so I think I'm okay. [00:27] And yeah, I also haven't seen it in years [00:28] Depending on what you do on the internet, you might browse 99+ % over HTTPS anyway. [00:30] Looking at my history, I definitely regularly browse some non-HTTPS pages [00:34] There are other solutions to this, namely we could have a system that retrieves everything multiple times from different locations, then verifies that they're the same and punishes those that provide mismatching responses. [00:34] That's how BOINC works [00:34] But we normally don't have time to do things multiple times here. [00:35] Yeah, that YouTube annotation archival project did it as well I think. [00:35] (That wasn't AT.) [00:36] That would require some nuance for checking if things are the "same" in some cases - random tracking IDs or element IDs (to prevent adblocking), apparently random order of some page elements (Youtube in some cases), etc. [00:37] this is getting real complicated :p [00:37] Yep, that as well. [00:43] If it could be made to work, the insecure-to-datacenter approach would probably cover most cases; the only thing you'd have to worry about (I think) would be outright disconnection as opposed to injection [00:44] Is that a thing? [00:44] Depends if you could get enough people running the scripts in a datacentre [00:45] That hasn't been a problem lately. [00:45] Just ping yano or Fusl, and they'll throw a couple thousand workers at it. lol [00:45] No, not really. A server provider won't ban you for scraping a website unless they get abuse reports [00:45] never happened to me [00:45] I don't think we've had capacity problems on OUR side of things for a while o_o [00:45] ah, well that's good then [00:45] Yeah, the target server might ban the IP or similar, but that's not unique to DC connections. [00:46] Well... We haven't had capacity issues on the worker side of things. [00:46] Targets are a different matter. [00:46] I have a setup now that lets me spin up hundreds of workers too :D [00:46] wibblywobbly targetywargety things [00:46] that's what I thought pnJay meant in the first place [00:47] Is there some way I can keep abreast of which projects need bunches of warriors without checking the wiki or something every day [00:47] Project launches are normally announced in #archiveteam, but other than that, you'd have to keep an eye on the project channel. [00:48] jasons twitter, i guess [00:48] Not really, no. [00:48] pnJay: that's how I usually find out about these things [00:48] JAA: well given how low traffic that channel is that kinda works for me [00:49] Jason's tweets are fine to find out that something needs to be archived, but not for "does this project need workers?" since he's typically not at all involved in the actual project. [00:49] last I checked I didn't get billed for all that overage bandwidth I used on Vultr either <_< [00:49] but I'll find out in a week and a half [00:50] I can also just leave more workers running on my home server if that help [00:50] right now they're mostly just idle or URLteam [00:50] *helps [00:50] I have one warrior (6 threads) up atm [00:51] JAA: better than nothing. I can jump in the channel and gauge from there [01:32] *** DogsRNice has quit IRC (Ping timeout: 276 seconds) [01:32] *** DogsRNice has joined #archiveteam-ot [02:09] *** dhyan_nat has quit IRC (Read error: Operation timed out) [02:40] *** SoraUta has joined #archiveteam-ot [02:45] IPv6 on digitalocean: "We support a maximum of 16 addresses (a subnet mask of /124 ) per Droplet. Additional addresses are not available. " [02:48] Hey, still better than OVH, where you get a single address. [02:48] (On VPS) [02:48] *** icedice has quit IRC (Quit: Leaving) [02:49] wow [03:05] *** cerca has quit IRC (Remote host closed the connection) [03:43] *** DogsRNice has quit IRC (Read error: Connection reset by peer) [03:50] *** ShellyRol has quit IRC (Read error: Connection reset by peer) [03:51] *** ShellyRol has joined #archiveteam-ot [03:51] *** LowLevelM has quit IRC (Quit: Ping timeout (120 seconds)) [03:51] *** LowLevelM has joined #archiveteam-ot [04:36] Or any provider at colocrossing where you get 0 ipv6 [04:47] *** qw3rty2 has joined #archiveteam-ot [04:56] *** qw3rty has quit IRC (Ping timeout: 745 seconds) [05:45] *** fuzzy802 has joined #archiveteam-ot [05:46] *** superkuh has quit IRC (Excess Flood) [05:46] *** nyany has quit IRC (Read error: Operation timed out) [05:47] *** voltagex has quit IRC (Ping timeout: 262 seconds) [05:48] *** fuzzy8021 has quit IRC (Read error: Operation timed out) [05:48] *** superkuh has joined #archiveteam-ot [05:49] *** VADemon_ has joined #archiveteam-ot [05:49] *** arkiver has quit IRC (Ping timeout: 360 seconds) [05:49] *** swebb has quit IRC (Read error: Operation timed out) [05:49] *** swebb has joined #archiveteam-ot [05:50] *** fuzzy802 has quit IRC () [05:50] *** fuzzy8021 has joined #archiveteam-ot [05:51] *** arkiver has joined #archiveteam-ot [05:51] *** svchfoo3 sets mode: +o arkiver [05:51] *** svchfoo1 sets mode: +o arkiver [05:51] *** ShellyRol has quit IRC (Read error: Operation timed out) [05:52] *** chfoo has quit IRC (Ping timeout: 360 seconds) [05:53] *** ShellyRol has joined #archiveteam-ot [05:54] *** Igloo has quit IRC (Read error: Connection reset by peer) [05:55] *** nyany has joined #archiveteam-ot [05:55] *** voltagex has joined #archiveteam-ot [05:56] *** Igloo has joined #archiveteam-ot [05:56] *** chfoo has joined #archiveteam-ot [05:57] *** svchfoo3 sets mode: +o chfoo [05:57] *** svchfoo1 sets mode: +o chfoo [05:59] *** VADemon has quit IRC (Read error: Operation timed out) [06:01] *** jamiew has quit IRC (Textual IRC Client: www.textualapp.com) [06:12] *** jamiew has joined #archiveteam-ot [06:16] *** jamiew has quit IRC (Read error: Operation timed out) [06:17] *** bluefoo has joined #archiveteam-ot [06:17] *** jamiew has joined #archiveteam-ot [07:43] *** ShellyRol has quit IRC (Ping timeout: 610 seconds) [07:54] *** ShellyRol has joined #archiveteam-ot [08:27] *** schbirid has joined #archiveteam-ot [08:33] *** LowLevelM has quit IRC (Read error: Operation timed out) [08:36] *** wp494 has quit IRC (LOUD UNNECESSARY QUIT MESSAGES) [08:38] *** LowLevelM has joined #archiveteam-ot [08:44] *** wp494 has joined #archiveteam-ot [08:50] *** killsushi has quit IRC (Leaving) [08:56] *** BlueMaxim has joined #archiveteam-ot [09:09] *** BlueMax has quit IRC (Ping timeout: 745 seconds) [09:44] *** Atom-- has joined #archiveteam-ot [09:50] *** Atom__ has quit IRC (Read error: Operation timed out) [10:06] *** dhyan_nat has joined #archiveteam-ot [11:15] *** BlueMax has joined #archiveteam-ot [11:24] *** BlueMax has quit IRC (Ping timeout: 276 seconds) [11:25] *** BlueMax has joined #archiveteam-ot [11:26] *** BlueMaxim has quit IRC (Ping timeout: 745 seconds) [11:27] *** DigiDigi has quit IRC (Remote host closed the connection) [11:32] *** cerca has joined #archiveteam-ot [11:55] *** BlueMax has quit IRC (Read error: Connection reset by peer) [11:55] *** BlueMax has joined #archiveteam-ot [12:40] *** ola_norsk has joined #archiveteam-ot [12:41] Archiving is futile https://www.youtube.com/watch?v=uD4izuDMUQA [12:42] *** dhyan_nat has quit IRC (Read error: Operation timed out) [12:42] When archiving a driver for a USB oscilloscope, i happened to notice one of the vendor's software/driver installer is said by WinXP to be corrupted. The file in question https://www.linkinstruments.com/mso19_3_setup_web_32bit.exe [12:43] is there any way to verify that it's not simply something awry with my virtualbox? [12:43] i've tried downloading the file several times, and unfortunately i don't have a "real" windows box to try it on [12:44] item in question: https://archive.org/details/Link_Instruments_MSO-19_usb_oscilloscope_software_2019 [12:44] it's such a dated file, i would think the manufacturer should have noticed and fixed it by now if it was indeed a corrupted file [12:46] tried wine? [12:50] schbirid: hmmm..wierd. Trough wine the same installer actually launches [12:50] i'm guessing it's just an XP problem then [12:53] Works for me in XP SP3 in QEMU [12:54] OrIdow6: here's the message i get when trying to run it https://i.imgur.com/WhqgdAo.png [12:55] but i take it the file is fine then, and the shit is with my vm [12:55] ola_norsk: download HashCheck Shell Extension so we can compare the file hash [12:56] Raccoon: couldn't md5sum or shasum do that as well? [12:56] sure [12:57] File: mso19_3_setup_web_32bit.exe CRC-32: c4f17baf MD5: b4332bf1a3f361ba1ad7b176d39b6be4 SHA-1: c7434ba7468d058d978051717853c67509e496c7 [12:57] md5sum say b4332bf1a3f361ba1ad7b176d39b6be4 [12:58] Likewise [12:58] as long as that's the file on your xp machine, then it's not corruption [12:58] something wonky with the winxp then, whatever it could be [12:59] did you perform the hash check within xp [12:59] no i just md5sum'ed it from the same virtualbox shared folder [12:59] just wondering if xp sees the file correctly, too [13:00] i'll try the HashCheck Shell Extension thingy now within the vm [13:00] Perhaps also move it from the shared folder to the native VM drive [13:10] odd, the same file trough the WinXP, it states the md5 to be 55b4d3ee442371e2bc25e19eeeb5b762 .. [13:11] The shell plugin keeps b4332bf1a3f361ba1ad7b176d39b6be4 for me [13:11] anywho, as long Raccoon got the same as in https://ia601408.us.archive.org/18/items/Link_Instruments_MSO-19_usb_oscilloscope_software_2019/Link_Instruments_MSO-19_usb_oscilloscope_software_2019_files.xml , i'll go with the file being actually fine and that it's my vm that is shit [13:14] *** ola_norsk has quit IRC (Quit: leaving) [13:27] *** MrRadar has joined #archiveteam-ot [13:28] JAA: that's my speciality :3 [13:28] i'm a little competitive [13:30] *** MrRadar has quit IRC (Read error: Operation timed out) [13:51] *** dhyan_nat has joined #archiveteam-ot [14:00] *** oxguy3 has joined #archiveteam-ot [14:01] hey anyone ever run into archive.org's web uploader rejecting a PDF file? getting 400 Bad Data with the error "Syntax error detected in pdf data. You may be able to repair the pdf file with a repair tool, pdftk is one such tool." [14:02] the pdf opens fine on my PC -- i tried opening it in Acrobat Pro and resaving it, which made a file with a different md5sum but apparently still one that archive.org didn't like. i'm on mac so can't get pdftk [14:03] is it better to use the first 256 bits of a sha512 or is that no better than just a sha256? [14:08] In my experience, archive.org's web uploader sucks. It almost never works. [14:09] i've been uploading a ton of pdfs without issue. this error i presume is on the backend (that error message actually comes from some raw XML that i had to hit "show details" to view) [14:12] i worked around it by zipping the PDF. not ideal since it means no derivative files get generated, but better than no file at all. if anyone cares to take a crack at figuring out the issue... https://archive.org/details/seattlesoundersfc2017mediaguide [14:32] *** BlueMax has quit IRC (Read error: Connection reset by peer) [14:41] oxguy3: You should be able to install pdftk via Homebrew https://brew.sh however, even after fixing the PDF with pdftk, I still get that same error on the fixed file [14:42] oh good to know, thanks [14:45] i also tried ilovepdf.com's repair tool, which curiously knocked about 1MB off the file size, but also failed to resolve the error. bizarre issue... [14:47] yup, errors even via using the command-line tool [15:17] *** SoraUta has quit IRC (Read error: Operation timed out) [15:28] damn, it happened again -- another supposedly corrupt PDF that opens fine on my computer https://archive.org/details/dcunited2017mediaguide [16:03] *** oxguy3 has quit IRC (My MacBook has gone to sleep. ZZZzzz…) [17:19] *** MrRadar has joined #archiveteam-ot [17:25] schbirid: "SHA-256 and SHA-512 are novel hash functions computed with 32-bit and 64-bit words, respectively. They use different shift amounts and additive constants, but their structures are otherwise virtually identical, differing only in the number of rounds." [17:25] https://en.wikipedia.org/wiki/SHA-2 [17:26] not really sure what "rounds" means [17:28] I think for the purposes of file verification, sha256 and sha512 truncated to 256 would be the same [17:29] well, not the same outputs, but the same usefulness [17:39] 256 bits of the SHA-512 hash is slightly safer than the SHA-256 hash because it prevents length extension attacks and some other things. Whether that matters depends on what you're trying to do, obviously. [17:43] Also, on decently modern machines, SHA-512 is a bit faster than SHA-256 due to the 64-bit arithmetics. [17:47] nice [17:48] *** VerifiedJ has joined #archiveteam-ot [17:48] *** wp494 has quit IRC (Ping timeout: 745 seconds) [17:49] *** wp494 has joined #archiveteam-ot [17:52] or you could use sha3-256 or sha3-256 via the command `rhash` [17:52] or sha3-512 [17:52] for example: `rhash --sha3-512 myfile.txt` [18:02] At least the OpenSSL implementation is a fair bit slower than SHA-2 though. [18:02] oxguy3 , that pdf has issues. what's the original link? [18:04] type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes [18:04] sha512 34398.78k 137441.77k 276030.81k 438686.72k 544983.72k 569671.68k [18:04] sha3-512 25492.26k 100461.65k 125587.97k 146192.04k 168471.21k 167242.83k [18:04] Ew [18:07] *** Dallas has joined #archiveteam-ot [18:21] *** DigiDigi has joined #archiveteam-ot [18:36] *** schbirid has quit IRC (Quit: Leaving) [18:53] Frogging: rounds has to do with the number of times the algorithm runs internally tp produce a final result [18:54] the fips standard for SHA-512 standardized two algorithms, SHA-512 and "SHA-512/t", the latter uses a different initialization vector (a lot less sketchy than the one for SHA-512 and SHA-256) and truncates the final output [18:55] what JAA was describing before is SHA-512/256 (SHA-512/t with t := 256), which has become a popular algorithm with followers of daniel bernstein [18:56] Actually, not exactly. SHA-512/256 is specifically the *first* 256 bits of the SHA-512 hash. [18:56] if SHA-512/256 is not available to you (it frequently is not), you could consider using SHA-384 instead, which also truncates its output and doesn't lend itself to length extension attacks trivially either [18:58] But you could really keep any 256 bit for an equivalent security level. (Not that there's any good reason to do so though.) [19:00] ah, calling the 256 leftmost bits of sha-512 a "sha-512/256" hash is wildly misleading [19:00] Well yeah, it's not the same due to the different IV. [19:02] :-( [19:02] fucking cryptographers [20:07] *** Dj-Wawa has quit IRC (Dj-Wawa) [20:08] *** Dj-Wawa has joined #archiveteam-ot [20:09] *** oxguy3 has joined #archiveteam-ot [20:21] does anyone know of any tools that can download the original PDFs from issuu/scribd without a subscription? there's a couple of free sites out there that will rip PDFs, but they're actually just generating new PDFs from JPGs which means you have to OCR them [20:29] *** X-Scale` has joined #archiveteam-ot [20:34] *** X-Scale has quit IRC (Ping timeout: 610 seconds) [20:34] *** X-Scale` is now known as X-Scale [20:44] *** SoraUta has joined #archiveteam-ot [20:44] *** tuluu has quit IRC (Ping timeout: 276 seconds) [20:49] *** oxguy3 has quit IRC (Ping timeout: 246 seconds) [21:20] *** VerifiedJ has quit IRC (Quit: Leaving) [21:25] *** oxguy3 has joined #archiveteam-ot [21:28] *** benjins has quit IRC (Read error: Connection reset by peer) [21:30] *** benjins has joined #archiveteam-ot [21:31] *** benjins has quit IRC (Remote host closed the connection) [21:33] *** benjins has joined #archiveteam-ot [22:48] *** oxguy3 has quit IRC (My MacBook has gone to sleep. ZZZzzz…) [22:48] *** oxguy3 has joined #archiveteam-ot [22:49] *** oxguy3 has quit IRC (Client Quit) [22:49] *** oxguy3 has joined #archiveteam-ot [22:49] *** oxguy3 has quit IRC (Client Quit) [22:50] *** oxguy3 has joined #archiveteam-ot [22:50] *** oxguy3 has quit IRC (Client Quit) [22:51] *** oxguy3 has joined #archiveteam-ot [22:51] *** oxguy3 has quit IRC (Client Quit) [22:52] *** oxguy3 has joined #archiveteam-ot [22:52] *** oxguy3 has quit IRC (Client Quit) [22:53] *** oxguy3 has joined #archiveteam-ot [22:53] *** oxguy3 has quit IRC (Client Quit) [22:54] *** oxguy3 has joined #archiveteam-ot [22:54] *** oxguy3 has quit IRC (Client Quit) [22:55] *** oxguy3 has joined #archiveteam-ot [22:55] *** oxguy3 has quit IRC (Client Quit) [23:12] *** BlueMax has joined #archiveteam-ot [23:33] *** dhyan_nat has quit IRC (Read error: Operation timed out)