[00:03] *** DFJustin has quit IRC (Read error: Connection reset by peer) [00:05] *** TC01 has quit IRC (Quit: No Ping reply in 180 seconds.) [00:05] *** t2t2 has quit IRC (Quit: No Ping reply in 180 seconds.) [00:06] *** eprillios has quit IRC (Quit: No Ping reply in 180 seconds.) [00:06] *** DFJustin has joined #archiveteam-bs [00:06] *** swebb sets mode: +o DFJustin [00:07] *** TC01 has joined #archiveteam-bs [00:07] *** eprillios has joined #archiveteam-bs [00:08] *** pikhq has quit IRC (Ping timeout: 260 seconds) [00:08] *** t2t2 has joined #archiveteam-bs [00:08] *** Sue_ has quit IRC (Ping timeout: 260 seconds) [00:13] *** pikhq has joined #archiveteam-bs [00:28] *** pnJay has joined #archiveteam-bs [00:45] *** j08nY has quit IRC (Quit: Leaving) [01:06] *** Sue_ has joined #archiveteam-bs [01:07] *** Stilett0 has quit IRC (Read error: Operation timed out) [01:09] *** ndiddy has quit IRC () [01:14] *** namespace has joined #archiveteam-bs [01:23] *** odemg has quit IRC (Remote host closed the connection) [01:26] *** phuzion has quit IRC (Ping timeout: 600 seconds) [01:26] *** phuzion has joined #archiveteam-bs [01:30] *** pnJay has quit IRC (Leaving) [01:39] *** pizzaiolo has left [01:48] *** odemg has joined #archiveteam-bs [01:51] *** Stilett0 has joined #archiveteam-bs [01:57] *** kyounko has quit IRC (Read error: Operation timed out) [02:18] *** RichardG has quit IRC (Read error: Operation timed out) [02:18] *** RichardG has joined #archiveteam-bs [02:28] *** ItsYoda has quit IRC (Quit: rippppp to the yoda you used to know!) [02:29] *** zhongfu_ has quit IRC (Remote host closed the connection) [02:29] *** ItsYoda has joined #archiveteam-bs [02:30] *** Meroje has quit IRC (Ping timeout: 260 seconds) [02:30] *** wm_ has quit IRC (Ping timeout: 260 seconds) [02:30] *** Zebranky has quit IRC (Ping timeout: 260 seconds) [02:31] *** Meroje has joined #archiveteam-bs [02:31] *** Zebranky has joined #archiveteam-bs [02:32] *** zhongfu has joined #archiveteam-bs [02:32] *** wm_ has joined #archiveteam-bs [02:41] *** bwn has quit IRC (Ping timeout: 244 seconds) [02:47] *** bwn has joined #archiveteam-bs [03:02] *** bwn has quit IRC (Read error: Operation timed out) [03:13] *** RichardG has quit IRC (Read error: Operation timed out) [03:13] *** RichardG has joined #archiveteam-bs [03:40] *** bwn has joined #archiveteam-bs [03:40] *** RichardG has quit IRC (Read error: Operation timed out) [03:40] *** RichardG has joined #archiveteam-bs [04:07] *** RichardG has quit IRC (Read error: Operation timed out) [04:07] *** RichardG has joined #archiveteam-bs [04:32] *** RichardG has quit IRC (Read error: Operation timed out) [04:32] *** RichardG has joined #archiveteam-bs [04:59] *** RichardG has quit IRC (Read error: Operation timed out) [04:59] *** RichardG has joined #archiveteam-bs [05:26] *** RichardG has quit IRC (Read error: Operation timed out) [05:26] *** RichardG has joined #archiveteam-bs [05:51] *** Sk1d has quit IRC (Ping timeout: 250 seconds) [05:58] *** Sk1d has joined #archiveteam-bs [05:59] HCross2: here now [06:01] Somebody2: Hi, looks like it may be worth starting again and totally getting all identifiers again [06:01] I ended up with 180 million identifiers somehow last night [06:02] Hm. 180 million *identifiers*? That doesn't seem right. [06:03] Yea.. i ended up running bashreduce and getting 18 million or so [06:04] considering that the first census found 14 million identifiers [06:04] However https://archive.org/~tracey/mrtg/alln.html is about 23.4 million and I'd like to get the lot [06:05] and the 2nd had 19 million -- it seems ... unlikely that we would have added over 170 million since then. [06:05] We have an idea of how many items there are, we just need to build a list of names [06:06] wait, what's wrong with using the list of names from https://archive.org/download/archiveteam_census_2017/2017.03-ia_identifiers.txt.gz ? [06:06] That's only uploads in March [06:06] no, that's all the identifiers *as of* March [06:06] as I understand [06:06] Ohhhh I've been an idiot if it is [06:07] Explains where the 180 million came from [06:07] lololol [06:07] yes, yes it would [06:08] So.. all of my faffing around yesterday with bashreduce was because I didn't check the file [06:08] eek! [06:08] That's ... a rather good sign we should improve the documentation. :-) [06:08] bwn: ping? [06:10] hm Somebody2 - thats only 535723 identifiers [06:10] hey hi [06:11] Hm. That actually might be correct -- there are a LOT of identifiers that aren't in the search engine. [06:11] over 22 million items not in the search? [06:13] hmm [06:13] we know our overall target is 23.4million, its just how we get there [06:13] Agreed. [06:13] Could we not just tell IA search to "list every identifier it can find" [06:14] via the python lib [06:14] That's what we do. Specifically, `ia search all:1 --itemlist` [06:14] You can see the source here: https://gitlab.com/bwn/cron-census-identifiers/blob/master/extract-identifiers.sh [06:14] That uses the python lib, just via a command-line wrapper. [06:15] bwn is planning to rewrite the script directly in python, to avoid path weirdness with cron. [06:15] ah I see, and that should in theory get everything there is [06:15] wait, zcat 2017.03-ia_identifiers.txt.gz | wc -l gives me 23237281 [06:15] ohh its compressed, thats where I was going wronjg [06:16] (its 6am lol) [06:16] * Somebody2 waves at HCross2's tired brain [06:16] :) [06:18] I get the same count from 2017.03 [06:19] ah thanks [06:19] 23,237,281 identifiers -- which should be plenty for the main body of the census. [06:20] Then we can do an appendix consisting of the identifiers present in any of the other identifier lists, but missing from that one. Most of them will be dark or otherwise inaccessible. [06:25] OK, I'm heading to sleep. G'night (or morning) all! [06:29] goodnight. [06:29] Census 2017 is now happening :) [06:44] *** Matt_Lock has joined #archiveteam-bs [06:47] Question: I'm running app.net in the warrior, and every download attempt it's showing gives either the result "Server returned 0 (CONIMPOSSIBLE). Sleeping." or "Server returned 0 (CONERROR). Sleeping." What's going on? [06:51] "It's dead, Jim" [06:52] At least that's my guess [06:52] So, switch to running urlteam or whatever? [07:11] *** RichardG has quit IRC (Read error: Operation timed out) [07:11] *** RichardG has joined #archiveteam-bs [07:26] *** JAA has quit IRC (Quit: Page closed) [07:37] *** RichardG has quit IRC (Read error: Operation timed out) [07:38] *** RichardG has joined #archiveteam-bs [07:40] *** j08nY has joined #archiveteam-bs [08:07] *** RichardG has quit IRC (Read error: Operation timed out) [08:07] *** RichardG has joined #archiveteam-bs [08:08] *** JAA has joined #archiveteam-bs [08:19] *** masterX24 has joined #archiveteam-bs [08:32] *** j08nY has quit IRC (Read error: Operation timed out) [08:43] *** schbirid has joined #archiveteam-bs [09:10] *** Coderjo has quit IRC (Read error: Operation timed out) [09:10] *** Coderjo has joined #archiveteam-bs [09:29] *** j08nY has joined #archiveteam-bs [09:38] *** BlueMaxim has quit IRC (Read error: Operation timed out) [09:41] *** GE has joined #archiveteam-bs [09:49] *** Jonison has joined #archiveteam-bs [10:03] Matt_Lock: Switch to Yahoo. It's a massive project, and needs more people on it. [10:04] shame you cant run it with more than concurrent 4 [10:05] Aye. [10:14] *** j08nY has quit IRC (Quit: Leaving) [10:26] *** RichardG has quit IRC (Read error: Operation timed out) [10:26] *** RichardG has joined #archiveteam-bs [10:53] *** RichardG has quit IRC (Read error: Operation timed out) [10:53] *** RichardG has joined #archiveteam-bs [10:54] Does anyone in here know of something like a checklist for use while uploading IA software items, with information like what metadata keys/tags to set and files to include, to minimize any hassle SketchCow and other collection administrators have to go through to get stuff in the right place? [10:55] I've been looking, but I'm not getting much further than some information in the IA FAQ and some documentation on the uploader and derivative formats [10:59] *** JAA has quit IRC (Quit: Page closed) [11:03] I'm currently writing some stuff down like "make sure to dump in a raw imaging format like BIN/CUE or IMG" and "Provide scans of the disk label and if possible the jewel case covers/booklets/inlays and boxes" [11:04] but I'm for example, having a hard time looking at what quality the scans should be other than saying "it should be OCR-able, try at least 300 dpi" [11:05] having a hard time specifying* [11:07] I think I'll just create a subpage under my personal page on the AT wiki and start tinkering [11:12] *** pizzaiolo has joined #archiveteam-bs [11:25] I'm trying to begin work on a book with some volunteer writers to discuss how it's done [11:25] Aha, nice! [11:26] But I can certainly use someone like you on the outside tinkering with a wiki [11:26] Although I know most of it. [11:26] I did speak about it too [11:26] https://www.youtube.com/watch?v=rW7w7ZphM3Y [11:27] I tried to go from "why is this even" all the way through [11:27] I've watched some of your talks but I hadn't seen this one yet, I'll check it out [11:28] *** odemg has quit IRC (Remote host closed the connection) [11:29] I'll tinker on the wiki as soon as I actually remember what password I used for the wiki, FML [11:31] Now it's throwing 508's at me, great [11:32] *** GE has quit IRC (Remote host closed the connection) [11:51] *** Jonison has quit IRC (Read error: Connection reset by peer) [11:57] *** JAA has joined #archiveteam-bs [11:59] *** JAA has quit IRC (Client Quit) [12:01] *** biziclop has joined #archiveteam-bs [12:01] *** biziclop has left [12:05] also @ sketchcow since you said 2 days ago hold off a bit on the DMOZ crawl and i asked how long to hold off you didnt respond/answer on that at all [12:10] *** odemg has joined #archiveteam-bs [12:14] I did but you were gone [12:14] I think you ended up coming up with a way to crawl the missing IDs and I encouraged it [12:22] *** pnJay has joined #archiveteam-bs [12:33] *** RichardG has quit IRC (Read error: Operation timed out) [12:33] *** RichardG has joined #archiveteam-bs [12:35] @sketchcow crawl got everything at the 14th [12:36] and i referred to this lines: Mar_15_2017_08:53:48 SketchCow Whut Mar_15_2017_08:53:55 SketchCow I'd say hold off for a bit, please Mar_15_2017_08:54:07 SketchCow (Upload of crawl) [12:39] result of crawl was 3.5 million urls without those flag, suggest, apply and abuse links [12:52] Somebody2: ping when you see this [13:10] *** VADemon has joined #archiveteam-bs [13:15] *** GE has joined #archiveteam-bs [13:18] *** tfgbd_znc has quit IRC (Read error: Connection reset by peer) [13:20] *** tfgbd_znc has joined #archiveteam-bs [13:36] *** odemg has quit IRC (Remote host closed the connection) [13:45] *** odemg has joined #archiveteam-bs [14:12] *** RichardG has quit IRC (Read error: Operation timed out) [14:12] *** RichardG has joined #archiveteam-bs [14:39] *** RichardG has quit IRC (Read error: Operation timed out) [14:39] *** RichardG has joined #archiveteam-bs [15:04] *** RichardG has quit IRC (Read error: Operation timed out) [15:04] *** RichardG has joined #archiveteam-bs [15:24] HCross2: ping! [15:34] *** LastNinja has quit IRC (Quit: byeeee) [15:59] *** odemg has quit IRC (Remote host closed the connection) [16:16] *** odemg has joined #archiveteam-bs [16:20] *** TC01 has quit IRC (Read error: Operation timed out) [16:24] *** TC01 has joined #archiveteam-bs [16:39] *** odemg has quit IRC (Remote host closed the connection) [16:45] *** LastNinja has joined #archiveteam-bs [16:47] *** JAA has joined #archiveteam-bs [16:55] *** masterX24 has quit IRC (Ping timeout: 268 seconds) [17:05] *** odemg has joined #archiveteam-bs [17:31] anyone know what the minimum requirements are for running the docker version of the warrior? thinking of getting a raspberry pi to run the warrior that i can leave at my mums house [17:31] i'd use an old laptop but i feel that'd be a little over the top for what it'd be running [17:32] *** phuzion has quit IRC (Read error: Connection reset by peer) [17:32] *** phuzion has joined #archiveteam-bs [17:32] *** phuzion has quit IRC (Remote host closed the connection) [17:33] *** phuzion has joined #archiveteam-bs [17:33] *** phuzion has quit IRC (Remote host closed the connection) [17:34] SpaffGarg well first the image would have to be arm compatible [17:38] *** phuzion has joined #archiveteam-bs [17:45] Now I'm curious: is there any particular reason why that can't be done? [17:46] git makes me think it has been done [17:46] https://github.com/ArchiveTeam/warrior-dockerfile [17:46] It looks like the Pi supports Docker so no reason you couldn't: https://www.raspberrypi.org/blog/docker-comes-to-raspberry-pi/ [17:46] Assuming you've got an ARM-compatible Docker image [17:46] i run the scripts directly on a pi [17:47] without warrior [17:47] I personally use an old laptop for AT stuff since it's WAY faster than my RPi 1. No clue how the 2 or 3 models stack up though [17:48] yeah i was going to do that as well but have warrior running the "choice" project for if something comes up overnight [17:49] i don't see any resource issues with the pi. doesn't seem remotely taxed by its role [17:50] but if you are going to leave it unattended, warrior seems like the way to go [17:50] *** icedice has joined #archiveteam-bs [17:50] yeah, i'll be 40 miles away from it apart from visiting [17:51] might just get one and see, if it fails ill turn it into a pineapple [17:53] That warrior-dockerfile repository contains a binary wget-lua.raspberry, but no information on how that's built. Does anyone know if there's anything special about it? Seems strange to me to include the binary instead of a build script. [17:57] *** schbirid has quit IRC (Quit: Leaving) [18:00] On a related note, I find it somewhat concerning that the get-wget-lua.sh script (in all the individual *-grab repositories) retrieves the wget-lua code over HTTP and doesn't even bother to use a checksum to verify that the file is fine. [18:07] *** odemg has quit IRC (Remote host closed the connection) [18:08] *** odemg has joined #archiveteam-bs [18:15] *** odemg has quit IRC (Remote host closed the connection) [18:19] *** odemg has joined #archiveteam-bs [18:19] *** odemg has quit IRC (Connection closed) [18:21] *** schbirid has joined #archiveteam-bs [18:26] *** ndiddy has joined #archiveteam-bs [18:28] *** odemg has joined #archiveteam-bs [18:49] *** odemg has quit IRC (Remote host closed the connection) [18:50] *** odemg has joined #archiveteam-bs [18:57] *** bwn has quit IRC (Read error: Operation timed out) [19:03] *** GE has quit IRC (Remote host closed the connection) [19:05] *** j08nY has joined #archiveteam-bs [19:06] *** odemg has quit IRC (Remote host closed the connection) [19:08] *** bwn has joined #archiveteam-bs [19:35] *** RichardG has quit IRC (Read error: Operation timed out) [19:35] *** RichardG has joined #archiveteam-bs [19:55] *** odemg has joined #archiveteam-bs [20:11] *** pnJay has quit IRC (Quit: Page closed) [20:24] *** RichardG has quit IRC (Read error: Operation timed out) [20:24] *** RichardG has joined #archiveteam-bs [20:24] JAA: https://github.com/JensRex/appnet-grab/blob/master/get-wget-lua.sh [20:24] Added https and sha256sum. [20:25] Testing and criticism welcome. Seemed to work alright for me. [20:28] diff: https://github.com/JensRex/appnet-grab/commit/3f855ce82844394a450cdfe8c00d81c65881a810#diff-90c27267aefa9ffc6a3afe742d8255fb [20:42] *** odemg has quit IRC (Remote host closed the connection) [20:44] *** pnJay has joined #archiveteam-bs [20:50] *** REiN^ has quit IRC (Max SendQ exceeded) [20:50] *** REiN^ has joined #archiveteam-bs [20:57] *** LastNinja has quit IRC (Remote host closed the connection) [20:57] Thanks JensRex. That runs tar and the check in parallel, right? I'd do that in separate steps, checking the sha256 hash first to ensure that only valid data is handled by tar. I'm not sure right now whether that's possible without a temporary file though. [20:58] I was trying to avoid temp files [20:58] Actually... I was trying to preserve the way the original did it, to be more accurate. [21:01] *** GE has joined #archiveteam-bs [21:05] The warrior VMs have ancient versions of curl/wget/OpenSSL that don't support the "Subject Alternative Names" certificate field so they won't connect to https://warriorhq.archiveteam.org/ [21:05] Will that be a problem? [21:05] Yeah, I can certainly see the benefit in that. But I can also think of several attack scenarios if tar is executed on a potentially invalid file, e.g. specially crafted pathnames to get tar to overwrite other files or simply a security vulnerability. So the hash should definitely be checked before calling tar if this is implemented. [21:06] Ah, I was about to ask that since I noticed the "https" in the diff [21:06] That makes sense [21:06] But yeah, downloading over HTTP, checking the SHA256 hash, then untaring should work and be secure [21:07] Until SHA256 is cracked, that is [21:08] The TLS config on warriorhq.archiveteam.org could also use a bit of tweaking. Right now it's using common 1024-bit DHE which are potentially vulnerable to the Logjam attack and it's not enforcing a secure cipher ordering (see https://www.ssllabs.com/ssltest/analyze.html?d=warriorhq.archiveteam.org and https://wiki.mozilla.org/Security/Server_Side_TLS) [21:09] *using a common 1024-bit DHE prime [21:09] *** LastNinja has joined #archiveteam-bs [21:09] Since we're talking about TLS right now: why doesn't archiveteam.org use it (connections are possible, but a certificate for a different domain is supplied), and why doesn't tracker communication happen over HTTPS? [21:09] The latter could be related to the Warrior problem you just mentioned [21:09] When it's been brought up before the answer is basically that we run on a volunteer effort and nobody has put in the work [21:09] I see [21:10] Setting up TLS isn't very complicated nowadays though, in my opinion [21:10] Especially since Let's Encrypt went online [21:10] (Which is used for the tracker.archiveteam.org certificate, by the way) [21:12] JAA: Good points about tar. [21:12] I'll rework the script to use temp files. [21:12] Once I've slept. [21:12] Haven't really slept in 3 days. [21:13] Also, warriors are embarrasingly outdated. [21:13] Hmm... I just checked the wget-lua built on the warrior VMs and they are using the same OpenSSL version as the included wget/curl which means they can't do TLS 1.1 and 1.2 [21:13] It should be possible without temp files, but certainly not easily and possibly with a performance impact [21:13] But I think they use precompiled 'wget-lua-warrior' from the git repos. [21:14] But I'm not sure. I don't use the warrior (any more). [21:14] How are warriors updated, by the way? Security vulnerabilities etc. [21:14] They're not. [21:14] Oh dear [21:14] They use EOL Debian 6. [21:14] With broken links in /etc/apt/sources.list [21:15] Right now virtually every site that supports HTTPS supports TLS 1.0 but that number will probably start dropping over the next 5 years since virtually all browsers can do TLS 1.2 and 1.3 is months away with huge performance gains [21:15] https://www.trustworthyinternet.org/ssl-pulse/ [21:16] Actually, looking at the data on that site above TLS 1.0 has already fallen from like 98% two years ago to 95% today [21:17] So the warriors receive no software updates *at all*? [21:17] Not really* [21:17] (*)except for Python 3.5 [21:18] :-| [21:18] I don't even want to know how many kernel vulnerabilities have been discovered in the meantime [21:19] Has a "Debian stable + unattended-upgrades"-like setup been considered? [21:19] I don't know. I'm a relative newcomer to AT. [21:20] Yeah. When Debian Stretch comes out I'll look into what it would take to build a new VM based on that (with automatic updates, etc.) [21:20] *** BlueMaxim has joined #archiveteam-bs [21:20] Yea, I was thinking rebuilding the warrior on Debian 9 would be a good idea. [21:20] Agreed [21:21] *** kristian_ has joined #archiveteam-bs [21:22] I've been toying with building warrior VMs, but laziness has won. [21:22] As far as buillding something fit for release anyway. [21:22] I think unattended-upgrades also takes care of rebooting when necessary. May need some tweaking regarding stopping the jobs first etc. (I've never actually used it myself as I prefer to know when to expect my stuff to potentially break.) [21:23] Yeah, that will be the tricky part [21:23] I saw that the current warrior was build using Debian installer scripting. When I tried looking into that, my brain core dumped. [21:23] I haven't looked at that. And now I certainly won't let that software anywhere near my machines :P [21:25] But yeah, I was going to say "that needs to be fixed ASAP", but it's probably not a bad idea to postpone it until the release of Stretch [21:25] I've mainly used Gentoo for many years. It's only somewhat recently I've become familiar with Debian, since Digital Ocean doesn't have Gentoo. [21:25] As that will probably happen sometime next month or so [21:37] Question: are yahoo_answers downloads supposed to last over an hour? [21:38] Matt_Lock: Possibly. They are quite big. [21:38] But still. Over 4000 items and counting? Is that normal? [21:38] Yep, had several over 5k [21:38] ditto [21:39] currently 2 of mine are at 3934 and 4127 [21:39] Got it. Thanks. [21:46] note to self: write a wikipage for TELNIC [21:47] *** dashcloud has quit IRC (Remote host closed the connection) [21:51] *** dashcloud has joined #archiveteam-bs [21:53] *** odemg has joined #archiveteam-bs [22:13] *** JAA has quit IRC (Quit: Page closed) [22:19] *** namespace has quit IRC (Ping timeout: 370 seconds) [22:40] *** RichardG has quit IRC (Read error: Operation timed out) [22:40] *** RichardG has joined #archiveteam-bs [22:45] *** odemg has quit IRC (Remote host closed the connection) [23:01] *** pnJay has quit IRC (Read error: Connection reset by peer) [23:28] *** RichardG has quit IRC (Read error: Operation timed out) [23:28] *** RichardG has joined #archiveteam-bs [23:34] *** kristian_ has quit IRC (Quit: Leaving) [23:56] *** GE has quit IRC (Remote host closed the connection)