[00:06] *** dashcloud has quit IRC (Read error: Operation timed out) [00:09] *** dashcloud has joined #archiveteam-bs [00:15] *** FalconK has quit IRC (Ping timeout: 260 seconds) [00:16] *** FalconK has joined #archiveteam-bs [00:50] *** sigkell has quit IRC (Ping timeout: 260 seconds) [00:58] am 19, can confirm that disk problems sound worrisome [00:59] *** Stiletto has quit IRC (Read error: Operation timed out) [01:00] *** sigkell has joined #archiveteam-bs [01:00] snicker [01:14] oh nice, https://github.com/ptal/expected is being proposed with Haskell do-ish notation for C++ [01:17] so you can write x <- failure_or_result(); y <- failure_or_result(); return op(x, y); and the compiler will generate appropriate things to propagate errors [01:17] *** Stiletto has joined #archiveteam-bs [01:17] this has code concision benefits, but mostly I like it because it means I can use more PragmataPro ligatures [01:45] *** Atros has quit IRC (Read error: Operation timed out) [01:45] *** atrocity has joined #archiveteam-bs [01:59] and as usual, i have a lot more upload bandwidth than download [01:59] i'll never understand this [02:01] 62 down, 68 up, but i'm rated at 50/50 [02:01] not complaining, just weird that up is so much faster [02:04] *** dashcloud has quit IRC (Read error: Operation timed out) [02:05] *** SN4T14 has quit IRC (Read error: Connection reset by peer) [02:08] *** dashcloud has joined #archiveteam-bs [03:16] By the way, for the record, I've been downloading from that site everyone got banned from (bootlegs) for 4 days now, no ban. [03:16] Why? Because I didn't attack a long-standing site with 10 parallel hits, that's why [03:17] Also, on FOS, I've now initiated 6 separate uploading processes. (Creating the MegaWARCs, uploading) [03:18] Eventually, I'm making a new uploader script that is centralized, puts the megawarc completions on a site with reverse viewing, and that notifies IA of the work, etc. [03:18] I just need to sit down to do all that. [03:18] *** dashcloud has quit IRC (Read error: Operation timed out) [03:19] *** bwn has quit IRC (Ping timeout: 492 seconds) [03:20] *** BnA-Rob1n has quit IRC (Ping timeout: 244 seconds) [03:22] *** dashcloud has joined #archiveteam-bs [03:22] *** BnA-Rob1n has joined #archiveteam-bs [03:22] *** Simpbrai_ has quit IRC (Ping timeout: 244 seconds) [03:22] *** Simpbrai_ has joined #archiveteam-bs [03:40] Spent the weekend planning out the Japan trip. Japan trip will be nuts. I'm essentially off the grid. [03:41] If I come back and there's a coup, I'm getting on my horse with a samurai sword and you should all inform your relatives you should be considered dead. [03:41] Other than that, it'll be a good time. (End of May to end of June) [03:42] Why are you going to Japan, out of curiosity? [03:42] Nice! Have fun in Japan, SketchCow [03:43] (Yes, I realize it's quite a ways away still) [04:00] http://archiveteam.org/index.php?title=Internet_Archive/Collections -- list of all the IA collections that contain other collections (at least, based on the recheck of the identifiers provided back in March 2015) [04:02] it occurred to me that I have not seen a chibi Jason Scott [04:04] *** Sk1d has quit IRC (Ping timeout: 250 seconds) [04:12] kawaii~ [04:12] *** bwn has joined #archiveteam-bs [04:13] *** Sk1d has joined #archiveteam-bs [04:43] *** metalcamp has joined #archiveteam-bs [04:48] *** tomwsmf-a has quit IRC (Ping timeout: 258 seconds) [04:56] the latest XKCD has me rolling on the floor by the end of it. it all accelerated so smoothly. [05:05] *** metalcamp has quit IRC (Ping timeout: 244 seconds) [05:10] *** Stilett0 has joined #archiveteam-bs [05:11] *** beardicus has quit IRC (Read error: Operation timed out) [05:12] *** Stiletto has quit IRC (Read error: Operation timed out) [05:12] *** schbirid has joined #archiveteam-bs [05:13] *** Jonimus has quit IRC (Read error: Operation timed out) [05:16] *** Mayonaise has quit IRC (Read error: Operation timed out) [05:17] *** Honno has joined #archiveteam-bs [05:35] wyatt8740: Haha wow, I need to catch up on xkcd apparently [05:37] https://archive.org/stream/the-patch-19xx/the_patch.19xx#page/n0/mode/1up <- THE PATCH! "Wouldn't you like to access YOUR entire typeset?" [05:44] *** beardicus has joined #archiveteam-bs [05:44] *** Mayonaise has joined #archiveteam-bs [06:32] *** metalcamp has joined #archiveteam-bs [06:44] *** Jonimus has joined #archiveteam-bs [06:44] *** swebb sets mode: +o Jonimus [06:53] *** JesseW has quit IRC (Ping timeout: 370 seconds) [07:20] *** Honno has quit IRC (Read error: Operation timed out) [07:31] *** bwn has quit IRC (Read error: Operation timed out) [07:59] *** metalcamp has quit IRC (Ping timeout: 244 seconds) [08:05] *** metalcamp has joined #archiveteam-bs [08:07] *** zenguy has quit IRC (Read error: Operation timed out) [08:11] *** bwn has joined #archiveteam-bs [08:46] *** dashcloud has quit IRC (Read error: Operation timed out) [08:49] *** dashcloud has joined #archiveteam-bs [09:33] *** lbft has quit IRC (Quit: Bye) [09:34] *** lbft has joined #archiveteam-bs [09:43] *** lbft has quit IRC (Quit: Bye) [09:47] *** metalcamp has quit IRC (Ping timeout: 244 seconds) [10:55] SketchCow: isn't going to japan and being off the grid like going to a LEGO convention and playing with Megablox? [11:51] *** metalcamp has joined #archiveteam-bs [12:37] *** BlueMaxim has quit IRC (Read error: Operation timed out) [13:04] *** SN4T14 has joined #archiveteam-bs [13:27] *** GLaDOS has quit IRC (Ping timeout: 260 seconds) [13:30] *** GLaDOS has joined #archiveteam-bs [13:33] *** VADemon has joined #archiveteam-bs [13:49] *** Stilett0 is now known as Stiletto [13:56] *** Honno has joined #archiveteam-bs [13:59] *** Start has quit IRC (Quit: Disconnected.) [14:29] *** Yoshimura has joined #archiveteam-bs [14:31] *** Kaz has joined #archiveteam-bs [14:32] *** kurt has quit IRC (Quit: leaving) [14:48] http://www.storybench.org/to-scrape-or-not-to-scrape-the-technical-and-ethical-challenges-of-collecting-data-off-the-web/ [14:49] EVERYONE NOT FOLLOW THAT PLEASE [14:49] I suspect I will check in, in Japan, but my time will equally be spent walking, going to events, and generally not being able to react to anything online. [14:54] *** Start has joined #archiveteam-bs [14:58] I don't get why not to scrape data when they have API [14:58] Or if it excludes the API [15:21] *** JesseW has joined #archiveteam-bs [15:57] *** JesseW has quit IRC (Ping timeout: 370 seconds) [16:05] *** Stiletto has quit IRC (Ping timeout: 260 seconds) [16:06] *** Start has quit IRC (Quit: Disconnected.) [16:43] SketchCow: from when to when are you not available? [16:43] err nevermind [16:43] "End of May to end of June" [16:49] we got time to plan the coup... [16:52] * arkiver muhahaha [16:59] never thought I'd say this but I'm so happy my 2tb harddrive came with free one day delievery [16:59] have to get rid of my optical drive to use it tho [17:00] fortunely, who uses disks anymore [17:02] *** Stiletto has joined #archiveteam-bs [17:04] I do use discs. The only externally connected thing is optical drive. [17:04] Optical media are kind of superior. Also if you are archiving old stuff, you need to create images. [17:04] probably shouldn't of said that in an archivist irc haha [17:05] * HCross2 grabs hammer of justice [17:05] oh do you guys find that, huh [17:07] * yipdw_ has had optical media fail in cold storage and these days just keeps stuff powered on disks [17:07] but I fear I have incited a nerd riot by stating that, so I will just pass that off as anecdotal and exit [17:08] JesseW: "THE PATCH (tm)" [17:08] eh, i keep things on spinning rust too [17:08] I only really use optical media for playing sega CD software [17:08] of which there's barely any good stuff [17:09] I have all my stuff on spinning rust [17:09] :\ [17:09] damnit I should have followed the topic advice [17:09] :P [17:10] another sign that optical media is dying http://i.imgur.com/rCROKbJ.jpg [17:10] also taiyo yuden's partnership with JVC for CD-R's stopped this january [17:29] *** Honno has quit IRC (Read error: Operation timed out) [17:37] I wonder how Archive.org stores the data. Anyone knows? [17:38] yipdw_: Both ways of storage have its merits [17:38] I fucking knew it [17:38] * yipdw_ mute [17:38] unmute [17:41] Yoshimura: live HDDs for digital data [17:41] also physical books etc [17:41] joepie91: No, I meant the actual ondisk format. [17:42] If they compress blocks, or if they compress each page separately, or if they use large warcs and parse it. [17:42] IT'S WARC [17:42] Warc is just format, their ondisc storage might be different. [17:43] nope [17:43] listen to me [17:43] it's warc [17:43] So they store gziped warc files? [17:43] did i say something confusing? [17:43] How large? And if one wants one page, they decompress whole large warc? [17:44] how large depends on how big the file from the crawler is [17:44] they use a different gzip stream per file with a cdx to tell them where to seek and start decompressing [17:44] So they just use index file for web and then decompress huge files to get one page? [17:44] they use a different gzip stream per file with a cdx to tell them where to seek and start decompressing [17:45] They did talk about special very efficient storage, and this sounds like dumb idea. [17:45] .warc.gz is a concatenation of gzipped warc records; you don't need to decompress everything [17:45] ok [17:45] you seek to the offset and decompress just that [17:45] Does not make sense what you said, but ok. [17:45] that's ok it doesn't have to make sense [17:45] it works [17:46] It does. [17:46] What would make sense is that its gzipfile, with per warc record blocks. [17:46] yep, and we're glad that it does work :) [17:46] I meant it does have to make sense xD [17:46] it does, and I suspect we're talking about the same thing [17:46] No, gzip has headers. [17:46] please analyze the structure of these warc.gz files generated by Archive Team [17:47] So concatenation would mean each record has gzip file header. [17:47] yeah [17:47] they do [17:47] And gzip can also decompress it as just one file? [17:47] yes, or as individual records [17:47] I have to then take a second look at gzip format. [17:48] Ok, thanks that helped a lot. Still do not see what is special very efficient storage on this. [17:48] I don't know who told you that it is special or very efficient [17:48] Compressing together multiple files, or even using xdelta for historical versions, or a common dictionary would make it work much much better. [17:48] it's not hyper-efficient but it's good enough for now [17:49] yes, and it'd cause complications elsewhere [17:49] I read that on archive.org pages few months back. [17:49] please submit your ideas to the Internet Archive, their engineers will be happy to solicit feedback from the Internet [17:49] No complications if you use deflate with pre-seed for dictionary. [17:49] the Petabox hardware is (was?) novel, if that's what you mean [17:50] *** Start has joined #archiveteam-bs [17:50] but that's still an assemblage of off-the-shelf parts, arranged to reduce administrator overhead [17:50] sure, any competent engineer could squeeze out another 10%. but when you put the same data in less space you add fragility [17:50] fragility is not a goal of the internet archive [17:50] xmc: I am talking about a lot more then 10% while introducing next to no fragility, if you are concerned with that. [17:51] why are you seeking to argue with me over this? [17:51] you asked how it works, i explained how it works [17:51] neither of us are in a position to change anything [17:51] xmc: I am not, sorry it it looks like so. [17:51] re fucking lax [17:52] well at least nobody mentioned blockchains [17:52] the world is not up to your standards and you're just going to have to be ok with it [17:52] if you want change you should work at the archive and push it through [17:52] i'm sure they'd be happy to have more competent engineers [17:52] *** bwn has quit IRC (Ping timeout: 246 seconds) [17:53] It does make sense if you have enough space to store it in simple format, yes. Still deflate with preseed dictionary would make it lot more (for html, or even images (headers)...) efficient adding next to no change, as zlib that is the standard for gzip, is standard and deflate is just gzip withotu header. [17:53] I am from other continent, so working there could be a problem, else would gladly do that. [17:54] yakfish: Maybe it was written in way that is confused soft and hard parts. Not sure [17:55] a preseed dictionary might improve compression ratio, but it would also complicate access [17:55] if you lose the dictionary -- whether it be a separate file, or a separate part of the stream -- you're in trouble [17:55] one advantage of concatenated gzipped warc records is that you have many ways to access that stream, and you can recover parts of it without much hassle [17:56] so you need to balance that in too. accessibility is a major concern alongside storage efficiency [17:57] yipdw_: Well I meant same way, per record deflate, of course. [17:58] if you can benchmark this and demonstrate significant gains on typical datasets that would be interesting [17:58] bonus points if existing replay tools can use it with no changes [17:58] ok, good idea. [17:58] IDK what are replay tools. [17:58] the other half of the equation that makes everything we do actually useful [17:58] Wayback Machine, pywb, webarchiveplayer are some examples [17:59] And dictionary would be a standard thing that should not be lost. If the warc was compressed in way, that only body of webpage would be gzipped in final format, so you could just stream that chunk to browser it would make sense. [17:59] please do investigate [18:00] But if its whole warc and you do not replay whole headers, means you still need to decompress, and that would make other (potentially time expensive, but very efficient (xdelta)) compression much better. [18:00] But if you still want to keep speed plus almost total compatibility (same library, different calls) then deflate would be the way. I will. [18:01] total compatibility [18:01] Total is nto doable. [18:01] almost will make for an interesting experiment, but it's a barrier to accessibility [18:01] Not much. Needs few lines of code change. [18:02] please check out http://archiveteam.org/index.php?title=The_WARC_Ecosystem [18:03] maintaining compatibility amongst these tools is a lot more important than saving a dozen gigabytes in a 50 GB fil [18:03] e [18:03] Building dictionary is though work. Needs either software that can analyze it well (lcs tree with statistical outputs along building the tree, or similar) or a lot of manual labor. [18:03] Okay. Will check it, but if someone wants 100% compatibility and there is not other way, then my investigation would be worthless. (Not to me though, so I might do it anyway) [18:04] Wget dedup is broken btw. [18:05] *** Honno has joined #archiveteam-bs [18:05] https://www.opposuits.com/pac-man-suit.html [18:05] hawt [18:07] https://github.com/ArchiveTeam/archiveteam-megawarc-factory#2-the-packer [18:08] Packing involves lots of gzipping and takes some time. [18:08] If they are just concatenated, why it involves lot of gzipping? [18:09] all your questions can be answered by reading this https://github.com/ArchiveTeam/megawarc/blob/f77638dbf7d0c4a7dd301217ee04fbc6a3c3ebbf/megawarc [18:09] *** Start has quit IRC (Quit: Disconnected.) [18:14] All found is that it tests the gzip files and uses gzip for that. [18:14] Which is honestly retarded. [18:15] As it has to spawn a lot of processes, and the warc files are tiny, so I can see most time spent on spawning the processes then on actual tests. [18:17] All the other gzip operations are handled via python via zlib and they are only used for json metadata. So the README.md could be in false statement or misleading. [18:17] *** Honno_ has joined #archiveteam-bs [18:18] Not sure about capability of python, but in best case it would take only single process and would reuse handle to the library for each warc record. [18:18] In the worse case, it would re-init the library handles. [18:20] And also while I dislike python personally, one has to ask himself why the whole megawarc-factory is not scripted, perhaps in python, using workers with thread safe operations for the queues. [18:21] *** Honno has quit IRC (Read error: Operation timed out) [18:22] *** Start has joined #archiveteam-bs [18:25] *** bwn has joined #archiveteam-bs [18:29] *** Honno has joined #archiveteam-bs [18:36] *** ndizzle has quit IRC (Read error: Connection reset by peer) [18:38] *** Honno_ has quit IRC (Read error: Operation timed out) [18:38] *** ndiddy has joined #archiveteam-bs [18:45] So after going through the info, I conclude that gzip is nice, while deflate with dictionary and a new tools (one, different languages) would add little complexity for potentially great benefits (to be determined experimentally with crappy amateur with insight dictionary). Those tools would be mere wrappers for zlib, just like gzip. Which only has [18:45] additional header. In this case plus dictionary, that would be elaborately selected over time by group of experts with input from amateurs over the world. So given in context of time, there could be only one universal dictionary for course of many, many years. Or optionally even language specific ones, or framework relative ones. ... Basically this [18:45] process would be same like HTTP/2, which does use predefined dictionary to compress HTTP headers and both save BW and improve latency at the same time. [18:46] *** RedType has left [18:48] External gzip is nice. It can be shadowed by pigz for multithreaded goodness. [18:49] zino: You fail to understand. [18:49] It does iterate over warc records, unless I misunderstood. [18:49] So pigz has no place there, also its made to test the file for errors. So I would only use gzip or zlib. [18:50] If space was concern over CPU, xdelta or per-line diff could be incorporated, having per-site specific dictionaries. Which could be made HTTP/1.1 SDCH extension compatible. ... This could be a great thing with great potential for personal or small scale archiving. But if someone does not care about space or money much, the deflate would be enough. [18:50] In time though CPU power is getting cheaper, while storage goes marginally down. [18:53] Oh, I missed the upper part of your wall of text. I missed that you where talking academic advantages, not current engineering optimizations. [18:54] arkiver: Asking you here xD [18:54] https://github.com/ArchiveTeam/fotolog-grab/blob/master/fotolog.lua#L122-L126 [18:54] just some little check [18:54] but it works [18:54] zino: Actually was talking in context of megawarc tool, which in my opinion is not bad though not great. Megawarc factory is another story though. [18:55] do you want to improve megawarc? then make it better! [18:55] patches are gladly accepted [18:55] xmc: I would rewrite it xD [18:55] ok [18:55] do it [18:55] Also not python, which might be... against some people. [18:56] i promise if it works just as well & is faster, then it'll be used [18:56] also if it doesn't require six hours of sysadmin work to put into place [18:56] But might try to make a patch for the gzip madness. Which is just what I talked all day today. Replacing gzip binary with zlib gzip to test the files. [18:57] I guess I have to download some megawarc to test it then xD [18:58] The only way I am ok with python is its widespread use, other people knowing it and it does the job. I do not dislike Python, but its syntax. xmc You are in charge / can have any input from you how long it takes to process one megawarc (packing) ? [18:59] if you're not going to use python then what do you have in mind? [18:59] Anything that works. C, C++, Ruby, other compiled language. But if current tools work, though ugly, I have other better things to work on honestly. [19:00] use python if you can [19:00] for consistency [19:01] ruby isn't compiled [19:01] From functioning perspective only patch to use zlib via binding and not binary for testing the warc records would solve the major culprit. No sysadmin (yep) is keen to see tons of processes. [19:01] Yes, I am aware its not compiled. xmc can you provide or from someone how long it takes to pack one megawarc? [19:02] * xmc shrugs [19:02] a while [19:02] *** JW_work has joined #archiveteam-bs [19:02] i don't run the machine where that happens [19:03] but i'd guess about half an hour for a 50G megawarc [19:03] Yeah, a day or two wait is fine. [19:03] Does the half hour exclude upload? [19:03] If yes, its insane. [19:03] xmc: yes, but do you want to "be in the machine where it happens" (hums the song) [19:04] Yoshimura: You might be interested in IPFS, if you haven't heard of it. They do some of the creative storage ideas you have been mentioning, IIRC. [19:05] Holy molly, looks nice. [19:05] yeah IPFS is a pretty impressive piece of work [19:05] packing one megawarc depends primarily on the I/O bandwidth you can throw at it; other factors like compressibility can also influence time [19:06] e.g. records containing video can take more time [19:06] all of which is a way of saying "it's variable and I don't care because I just start the process and let it go" [19:06] I will be honest now. Although I seen a lot of stuff already, I had same or (much) greater ideas then implemented today, but 8 years ago. [19:07] that's nice — are they ready for us to drop in and use them? If not, thanks but no thanks. [19:07] Have to look more into this to talk. But one great example of a storage system that conforms or approaches quality of my designs (I am lone person, never studied CS) is Cassandra [19:08] some people around here use Cassandra as a data-store for their projects, yes [19:08] we don't require it because we value the ability to get stuff up and running on some random Linux machine [19:08] and Cassandra, despite its merits, incurs operational complexity [19:09] yipdw_: even I/O could be leveraged. Look, if you run more then one worker on HDD platter you are screwing I/O. [19:09] archiveteam values: simplicity, reproducibility, completeness, portability [19:09] Yoshimura: yeah we know this, that's why some packers use SSDs [19:09] Yoshimura: can I please ask you to trust that we aren't idiots [19:09] no [19:09] you can't [19:09] That to me sounds "we know its shit so we throw more hardware at it" [19:09] we are apparently incompetent and crazy [19:10] Sounds like "I want to kill you to sysadmins" [19:10] not really; the system we use was loaned to use by a sysadmin who had SSDs laying around [19:10] I do trust you more then most people I met on the net. I think (or try) objectively, Which is how I get to a lot of my designs. [19:11] what I want to communicate is that if you want to rewrite things, that's fine. however these tools are in the state that they are primarily because they work, they have known behaviors, and they aren't really that broken [19:11] in cases we do consider removing bits when it's clear that we have hit their limits [19:11] it's not the most efficient but it's sturdy [19:11] this is one reason why wpull exists, for example [19:12] But whoever wrote the test_gz routine was not that competent. [19:12] ok fine [19:12] please patch it [19:12] Yoshimura: just so you know, you are wearing out the patience of your audience. Actual running code, especially with stats showing it works as well or better than alternatives — yes, please. Random claims about your designs, or the competence of other people — no, thanks. [19:13] I meant that with absolute respect, noone is competent enough in evrything. And random claim was offtopic note, sorry, about your patience. [19:13] I took it as educational discussion, not complaint or rant. [19:14] This is my (major?) problem, different way of understanding thing, sorry about that and expect that please in my case. [19:15] well [19:15] stop insulting people [19:15] with specific regard to why megawarc does what it does, I don't know. if you'd like to remove the gzip step and make it more akin to cat, then please do so [19:15] The problem is, some stuff does not sound like insult to me. [19:15] if it generates valid output then it would probably give us a nice speed boost [19:15] Borp [19:15] Yoshimura: accept it & move on [19:15] What the hell is going on in here [19:15] I do accept it, I moved on. [19:16] lol [19:16] So am I seeing another great, classic case of "YO HO EVERYBODY CALM DOWN, I AM HERE, I HAVE A NEW PARADIGM AND YOU PLEBES ARE GONNA LEARN JUST STEP BACK" [19:16] Or are we seeing helpful advise from a brave new warrior. [19:17] no I think what we're seeing is just a language thing that we'll all work out soon [19:17] Got it. [19:17] My least favorite thing is when someone wants to rewrite X in Y [19:17] Which is good or I'd make all you shits rewrite everything in bash [19:17] #!/bin/sh forever [19:18] let's rewrite wpull in haskell [19:18] it seems simultaneously appropriate and inappropriate to point out that /bin/sh isn't bahs [19:18] but enough of that [19:18] * xmc laughs [19:18] bash? Bash?? What about ed? [19:18] * SketchCow rewrites yipdw_ in bash [19:18] baahaaahaaa [19:18] * yipdw_ is shellshocked [19:19] nah, rewrite it all in FORTRAN [19:19] * JW_work heard of a terrible idea to rewrite a whole system purely in a custom SQL dialect once… [19:19] you know FORTRAN kinda means "runaway" in german [19:21] SketchCow: The part "trying to show you, not only write it, how it works bad and better" is indirectly true. Yes, I am new to warrior, knew about archiving a while and already done work on myself. Sorry to sound like annoying know it all. Having to work alone does vane off social comm skills, technical remain. [19:21] Stop with the FORTRAN! It's anough that I have to deal with it on payed time. [19:21] enough* [19:22] Yoshimura: no problem. honestly if you can speed up megawarc I would be happy to try it out [19:22] to be honest, the bottleneck isn't really compression [19:22] Can you propose a download link to HTML file to test on? [19:22] it's upload to IA [19:23] however if the speed up is due to reduced I/O then that can be nice alsoo [19:23] Yeah, I am aware. Thats why I would like input or test on my own before I spend time rewrting it whole. [19:23] The test_gz is done fine. Problem is handling a lot of files in non-sequential manner. [19:24] I'll point out gamefront files are 15gb apiece up on archive [19:24] yes, we have many megawarcs in the archiveteam collection, e.g. [19:24] https://archive.org/download/archiveteam_hyves_20131120141647 [19:24] That's your smallest. [19:24] gamefront is good too [19:24] e.g. https://archive.org/download/archiveteam_gamefront_20151112045523 [19:25] to demonstrate speed increases, I suggest unpacking the megawarc into its components (individual WARCs and extra files); the megawarc program can do this [19:25] then repacking them with your proposed algorithm [19:25] Yes, that is my plan. Will try, will do, and will come back with it, or shut up about this single problem. [19:26] I'd also be interested in whether I/O load goes down [19:27] So [19:27] http://fos.textfiles.com/ARCHIVETEAM/ [19:27] I have to more thoroughly analyze the script then can tell more. io might not go down in bw, but might get converted to more sequential at cost of re-reading a file, which on hdd platter should be faster. [19:28] Yay! cheers [19:28] Also, shout out to the three of you guys who messaged me WAIT NO DO NOT OBLITERATE THE NEWBIE [19:29] Thanks ^^ [19:29] As long as you realize there's 5 years of good decisions behind the current setup. [19:30] With smart people making good passes, and divesting themselves of choices for reasons between efficiency and logic. [19:30] That's why I care. Stuff that is done poorly usually have people that do not accept constructive improvements. [19:30] That's why I care that when you use phrases like "poorly" and "competent", you're being Linus without the cachet [19:31] Or ... [19:31] * SketchCow looks around [19:31] ...Theo [19:31] * zino hides [19:31] * SketchCow hears thunder [19:31] theondr [19:32] obliterating the newbie is something I'm trying to purge from me [19:33] that and shitting on software [19:33] it's hard [19:33] or shitting on the author I guess [19:33] So http://fos.textfiles.com/ARCHIVETEAM/ is the vanguard. [19:33] oh cool [19:33] Now I have one script that packs up the items and then hands it to a general script. [19:33] It used to be three, one in each folder. [19:33] Some social stuff is hard for different people. I grew up with tech, nature, not people. [19:34] #!/bin/sh [19:34] # SPLATTER - Pack up the content and place it up. Needs Upchuck to work. [19:34] # EDIT THE SETTINGS TO MAKE SURE YOU GO TO THE RIGHT PLACE. [19:34] TITLE="FTP Site Download" [19:34] CLASSIFIER="ftp" [19:34] COLLECTION="archiveteam_ftp" [19:34] echo "We're putting this into $COLLECTION." [19:34] each=$1 ITEMNAME=${CLASSIFIER}_$each [19:34] echo "Going down the collection linearly." [19:34] mkdir DONE [19:34] for each in 2* do ionice -c 3 -n 0 python /0/SCRIPTCITY/megawarc --verbose pack ${CLASSIFIER}_$each $each [19:34] mv $each DONE [19:34] bash /0/SCRIPTCITY/upchuck "$COLLECTION" "$each" "${TITLE}" [19:34] mv "${CLASSIFIER}_${each}."* DONE [19:34] done [19:34] So, that's the whole script. Note the TITLE/CLASSIFIER/COLLECTION settings. You set everything there. [19:34] The rest does the work without me editing (chances for problems) [19:44] *** Start has quit IRC (Quit: Disconnected.) [19:47] It now forces me to make a collection for each set. [19:47] Which I should have been doing. [19:56] *** schbirid has quit IRC (Quit: Leaving) [20:04] I want to eventually find out what zino is up to. [20:04] http://eldrimner.lysator.liu.se:8080/archiveteam.txt [20:04] We're in the process of uploading [20:48] *** SimpBrain has quit IRC (Quit: Leaving) [20:51] Yoshimura: worth noting that IA / ArchiveTeam do not necessarily have the same requirement as $randomStartupThat'sGoingToGoBankruptInAYear [20:51] lol [20:51] Yoshimura: the most important aspect of storage, for example, isn't that it's efficient or fast or elegant or whatever. the most important aspect is that it's robust [20:51] that when you point somebody at it in 50 years [20:52] they can still trivially decode/read/whatever it [20:52] Good joke. [20:52] I consider "paper + nerd with a scanner" a perfectly acceptable form of long-term storage, fwiw :P [20:52] Yoshimura: making light of a serious explanation is not a great way to encourage discussion. [20:52] But I agree. 50years well not sure what formats will be used. [20:53] my point here is that something that may look 'suboptimal' isn't necessarily suboptimal, and even if it is, the benefits of 'fixing' it do not necessarily outweigh the drawbacks [20:53] Papers degrade, nerds and scanners also. [20:53] Yoshimura: yes, but we still have paper docs from a very long time ago. [20:54] papers degrade quite a bit slower than basically every other automatable form of storage we currently have available to us [20:54] ^ [20:54] but it's still worth *investigating* possible fixes — we can find out why they aren't a bad idea *after* they are available to be tried :-) [20:54] er, "are a bad idea" I meant. [20:54] and as much as paper degrades, the pile of dead WD Blues I currently have my feet resting on would like to have a word with you ;) [20:54] JW_work: sure, it's more the attitude I'm concerned with than the suggestions [20:54] What I gained from my life that the only way to ensure time longetivity is cahnging the format to one currently supported in that time period [20:54] Yoshimura: s/supported/accessible. [20:54] JW_work: I'm all for suggestions for improvement, but people have to stay realistic, and understand why there isn't a channel immediately jumping on a shiny-sounding idea to 'do everything better' [20:55] So warc itself is fine, way to store warcs do and will change. [20:55] and why insisting on how much 'better' it is is just going to be counterproductive, as compared to writing out a proof of concept [20:55] and a well-reasoned analysis of the benefits and drawbacks [20:55] archival isn't anything like any other kinds of data storage you'll come across [20:56] joepie91: no argument from me [20:56] I stopped insisting anyone. [20:56] JW_work: right. this is what I've been trying to explain to Yoshimura ;) [20:56] And proofs are not written in chat, so if someone keeps the conversation in loop, like you do now, I just reply with arguments, if you oppose. [20:57] Yoshimura: it's perfectly acceptable to say "I'll get back to this later, let me write it out first" [20:57] Thats logical. So its not about "only throwing hands" but about not ignoring conversation. [20:57] but once you start making claims, don't be surprised when people push back against them [20:57] especially if said claims are made with an air of superiority [20:57] Yeah, that I agree. [20:58] if you have an idea for improvement, then it'd be greatly appreciated if you could write it out in detail, analyzing the benefits and drawbacks, metrics, etc. [20:58] there are definitely areas that could use improvement [20:58] Yoshimura: my advice is that if you have a POC, show us, or if you think you can have a POC then say "hey i'm gonna go and hack on this till it works vaguely" and then come back and show us [20:58] I'm just saying that an ad-hoc discussion here in channel like this is probably not the best medium for that :) [20:58] I do, but not for warc also is not using warc files [20:58] #archiveteam-bs-bs? ;) [20:59] This is the place for a discussion. [21:00] I did deflate with dictionary in 2008 used flat files. One for pages one for respose headers [21:00] So basically DIY warc. [21:01] Yoshimura: have you read the WARC (draft) spec? [21:01] Why so? [21:01] Never ever create custom WARCs [21:01] Yoshimura: no hidden point, just a general question :P [21:01] what's a "custom WARC"? [21:01] I did read stuff, not shole spec. [21:02] JW_work: I mean like stuffing some headers and data you found somewhere in a WARC [21:02] Yoshimura: right. so one thing to take into consideration, aside from recommending that you read the full spec, is that not all responses are necessarily HTTP [21:02] for example, Heritrix also writes DNS requests and responses to the WARC [21:02] WARC is more-or-less agnostic to what exact kind of data it contains [21:03] Yoshimura: there's a few subtleties like this, plus the unusual requirements of long-term archival, that make it somewhat difficult to design new/compatible implementations [21:03] Yoshimura: also, I -think- the index file implementation that IA uses is not part of the WARC spec [21:04] not sure what limitations that introduces [21:04] docs-wise, there's https://archive.org/web/researcher/cdx_file_format.php and https://archive.org/web/researcher/cdx_legend.php and some stuff that can be inferred from code like https://github.com/joepie91/node-cdx [21:05] I did not say anything about new implementation. [21:05] *** dashcloud has quit IRC (Read error: Operation timed out) [21:05] just dumping all info here :P [21:05] Just a better way to store the warc encoded data [21:05] arkiver: ah, hm. I noticed jake from IA converted a set of static HTML (from the BBC's website as of 1995) into WARCs using Python's simpleHTTPserver module: https://archive.org/details/bbcnc.org.uk-19950301 [21:06] Yoshimura: point is, .warc.gz is pretty much a fixed format in and of itself, in part due to the cdx files [21:06] Yoshimura: for example, the cdx contains offsets of the compressed data [21:06] using which you can read out specific records [21:07] even over a HTTP range request [21:07] un-gzipping the extracted range in the process [21:07] these are all things to take into account [21:08] My stuff was per record. [21:09] *** dashcloud has joined #archiveteam-bs [21:09] And the conversation also. [21:12] Ungzipping over http range is honorable, but it does include the warc headers, so pretty useless. Also gzip is http capability, not needed part. [21:17] Yoshimura: not sure what you mean [21:18] Gzip is used for whole record, including warc header. [21:18] But if you want to serve it as file you need to process first. [21:18] So gzip over http is half useless. [21:19] Yoshimura: it's not. it's how I can search for and extract records from an archive.org-hosted WARC file without downloading terabytes of data. [21:19] the gzip is a technical property, not a feature [21:20] I'm often concerned about how much data is duplicated in the WARCs on IA. There's probably quite a lot of stuff that is in there twice in separate WARCs [21:21] Frogging: 'duplicated' by what definition? [21:21] joepie91: Are you dreaming or ignoring? [21:21] well, say I download a site and put the WARCs on IA. Then someone else does the same, and there's now two WARCs with mostly the same data [21:22] Yoshimura [21:22] My stuff was per record. [21:22] And the conversation also. [21:22] and deflate IS per record. [21:22] I remember xmc (maybe) posting some study that showed a 50% improvement over deduplicating some WARC dataset [21:22] so it's big, but not like fatal [21:23] trying to find that link [21:23] 50% sounds rather huge [21:23] it's one dataset and I don't remember the details [21:24] k [21:24] I also regret even citing that figure becaue now people will extrapolate it to literally everything [21:24] joepie91: While your input is indeed very constructive, the fact that ignores that I did not propose compressing more files at once is not. [21:24] *** metalcamp has quit IRC (Ping timeout: 244 seconds) [21:24] * SketchCow breaks pool cue in half [21:24] Back in an hour [21:26] yipdw_: Heh. Well, I'd be interested in running my own tests [21:26] http://fos.textfiles.com/ARCHIVETEAM/ works [21:26] lmao. that gif [21:26] What I found strange or there is some problem with regular wget is that compressing 169MB warc.gz as one file makes 150MB [21:27] Yoshimura: I'm not making any technical claims, just pointing out usecases to consider [21:27] that sounds like a reasonable gain, depending on the number of records in the warc [21:28] the reason why we don't do that is random access becomes more difficult [21:28] HTML pages only, not sure how many. Not reasonable much to me, as I remember that using deflate did save much more. [21:29] yipdw_: It was just as test. As the deflate with dictionary and gzip togehter should work kinda same way. As the dictionary is merely a precondition of sliding window [21:30] If its 20% I do not expect people to care. I just learned to handle stuff by care factor, not by real impact. [21:31] SketchCow: i'm up to 2015-10-01 with kpfa [21:31] we are very close to being complete [21:31] And if its better format that save a fck ton but is more complex in terms of tools and that it needs changes usualy argument is too much work. [21:31] So in either way stuff does not happen. [21:33] Thank youuuu [21:33] operations, migration strategies, backwards compatibility are all big deals yeah [21:34] Why is SketchCow destroying pool cues [21:34] Frogging: https://www.youtube.com/watch?v=VCXbib9MahE [21:34] yipdw_: Resources and hence money is big deal also. But seems most people are more keen at throwing more money or being able to achieve less. [21:35] I suspect in a couple of decades, if costs/gb fail to decline fast enough, IA may be a lot more interested in de-duplication; but at this point I think computing power is not sufficiently cheap (as compared with storage) to make it worth it. [21:35] lol I see :p [21:35] Still worth looking into though. [21:36] I feel like I just got told to go fuck myself [21:36] that's coo [21:38] apologies, that wasn't necessary [21:43] http://fos.textfiles.com/ARCHIVETEAM/ seems to be working. Adding "size" now. [21:51] SketchCow: Anything in particular you wanted to know or just why the uploads are going so slow? [21:53] No, no. [21:54] It's that you're uploading and using methods and I'm uploading and adding methods. [21:54] Are you being given new stuff or just doing a backfill? [21:54] SketchCow: I don't think I'm being fed anything since last thursday, so only archiving anf uploading now. [21:55] and* [21:58] Spent half an hour reading up on long-haul TCP Yesterday to see if I could push a bit more against IAs s3 bucket. Haven't had to do big data transfers over anything more than 1000km for over four years, so much of that has dropped out of my brain. [22:02] *** tomwsmf-a has joined #archiveteam-bs [22:03] Yeah, you're doing a version of what I do, so we'll have a little weirdness for keeping track of things for a while. [22:04] There's no reason/victory. You're our emergency generator, no need to line you out with needless accounting. [22:05] Minimal accounting sounds good. :-) [22:07] Your contribution is appreciated. [22:08] I could contribute cpu but not storage. [22:08] There should be some social movement to make boinc-like network using tiny VMs. [22:10] zino: What did you read on the tcp? Is there anything how you can improve? [22:10] If it was me I would switch to udp [22:11] OK, dude [22:11] Hold up. [22:11] BRAKES. [22:11] People discussing anything in this channel and you coming in rushing in with 'solutions' [22:11] Stop. [22:12] And if you go, "Huh, ANOTHER community unwilling to hear new ideas." [22:12] Start thinking. [22:12] Maybe it's YOU. [22:12] Is that aimed to me? [22:12] YES [22:12] Heh. [22:13] I just asked about how can I improve TCP, and did provide _personal_ (if it was me) _opinion_ not 'solutions' [22:13] [22:13] YYYYYYY YYYYYYYEEEEEEEEEEEEEEEEEEEEEE SSSSSSSSSSSSSSS [22:13] Y:::::Y Y:::::YE::::::::::::::::::::E SS:::::::::::::::S [22:13] Y:::::Y Y:::::YE::::::::::::::::::::ES:::::SSSSSS::::::S [22:13] Y::::::Y Y::::::YEE::::::EEEEEEEEE::::ES:::::S SSSSSSS [22:13] YYY:::::Y Y:::::YYY E:::::E EEEEEES:::::S [22:13] Y:::::Y Y:::::Y E:::::E S:::::S [22:13] Y:::::Y:::::Y E::::::EEEEEEEEEE S::::SSSS [22:13] Y:::::::::Y E:::::::::::::::E SS::::::SSSSS [22:13] Y:::::::Y E:::::::::::::::E SSS::::::::SS [22:13] Y:::::Y E::::::EEEEEEEEEE SSSSSS::::S [22:13] Y:::::Y E:::::E S:::::S [22:13] Y:::::Y E:::::E EEEEEE S:::::S [22:13] Y:::::Y EE::::::EEEEEEEE:::::ESSSSSSS S:::::S [22:13] YYYY:::::YYYY E::::::::::::::::::::ES::::::SSSSSS:::::S [22:13] Y:::::::::::Y E::::::::::::::::::::ES:::::::::::::::SS [22:13] YYYYYYYYYYYYY EEEEEEEEEEEEEEEEEEEEEE SSSSSSSSSSSSSSS [22:13] [22:13] [22:13] [22:13] Yoshimura: No biggie. And is Amazon had gridftp or similar set up, sure I could switch to that. Alas... :-P [22:14] *** Start has joined #archiveteam-bs [22:14] Apparently SketchCow Tripped over his brain. So tell that to him. [22:14] zino: Is there any link to what you read about TCP improvements? [22:14] Do it. [22:14] Take me up. [22:14] Take me on. [22:16] I wanted to learn about the TCP, the UDP was _personal opinion side note. No reason to start fires over that. [22:16] Yoshimura: Those tabs are closed, but it's easily googlable. You need to be able to negotiate a TCP window big enough to not still over your latence (ping). And give the kernels some more memeory and bigger queues to work with. [22:16] And the same needs to be turned on in the reciver side as well. [22:17] s/still/stall/ [22:17] Oh, great, yeah, that makes sense. Thank you ;) [22:18] *** VADemon has quit IRC (Quit: left4dead) [22:19] *** chfoo- has quit IRC (Read error: Operation timed out) [22:19] *** chfoo- has joined #archiveteam-bs [22:19] Yoshimura: you're welcome to stay here but you are not allowed to say anything for the next 12 hours [22:20] this applies to all archiveteam channels [22:22] *** wacky_ has quit IRC (Ping timeout: 244 seconds) [22:23] *** wacky has joined #archiveteam-bs [22:31] http://fos.textfiles.com/ARCHIVETEAM/ now shows sizes! [22:40] *** Fletcher_ has quit IRC (Ping timeout: 250 seconds) [22:40] *** koon has quit IRC (Ping timeout: 250 seconds) [22:40] *** espes__ has quit IRC (Ping timeout: 250 seconds) [22:40] *** espes__ has joined #archiveteam-bs [22:43] And (in theory) I just added Archivebot. [22:52] *** tomwsmf-a has quit IRC (Read error: Operation timed out) [22:57] Looks good SketchCow [22:57] Liveprogress on uploads :D [23:03] It won't show zino but that's OK. [23:03] Just another thing for people to notice. [23:03] Soon, I'm going to make it much more automatic (but also be linear.) [23:04] So only one pack-and-ship is happening at once. [23:04] With theory being it will just go relentlessly and not wait on me any further (and archivebot will go on its own.) [23:05] It would be nice if the page automatically refreshes when it's updated [23:05] So we can just leave it open in a monitor and watch it do stuff [23:06] What's zino? [23:06] If someone wants to write that code and hand it to me, I'll shove it in. [23:07] http://first.archival.club/ by the way [23:25] Also, now I am making one subdirectory on FOS that does nothing but these pipes. [23:25] This is an important step for the future for 1. Knowing what projects there are pipelines for (maybe run a script often to show it) and 2. Be able to write a script to say "run everything in here." [23:37] *** koon has joined #archiveteam-bs [23:54] *** tomwsmf-a has joined #archiveteam-bs [23:58] *** Stiletto has quit IRC (Read error: Operation timed out)