[00:10] hrrm. EFNet: "You have joined too many channels". well that sucks [00:13] it is a bit interesting that this channel (-ot) has a topic that refers to bikesheds ('bs'), but the other channel is the one that ends with -bs :) [00:42] *** BlueMax has quit IRC (Read error: Connection reset by peer) [01:09] *** Wingy has quit IRC (The Lounge - https://thelounge.chat) [01:10] *** Wingy has joined #archiveteam-ot [01:25] Has anything been done for VampireFreaks? https://vampirefreaks.com/journal_entry/8876284 - social network closing February 1st 2020 [01:26] Is the internet as fire prone as Australia atm? [01:35] it always is [01:41] https://en.wikipedia.org/wiki/Vampirefreaks.com - started in 1999. I wonder how much is already on archive.org? [01:51] *** BlueMax has joined #archiveteam-ot [02:03] *** asdf0101 has quit IRC (The Lounge - https://thelounge.chat) [02:03] *** markedL has quit IRC (Quit: The Lounge - https://thelounge.chat) [02:14] Per Deathwatch, the Internet is very 'fire' prone. And also can be prone to real fires. Some originals were lost in last year's California fires...only the copies elsewhere survived. [02:15] *** Wingy has quit IRC (The Lounge - https://thelounge.chat) [02:15] *** Wingy has joined #archiveteam-ot [02:35] *** X-Scale has quit IRC (Ping timeout: 745 seconds) [02:40] josey, added VF to the deathwatch [02:41] if you know people in that community, maybe they can make submissions to IA via SPN of stuff they want to save. [02:41] I saved a few journal entries via SPN [02:50] *** LowLevelM has quit IRC (Remote host closed the connection) [02:52] *** X-Scale has joined #archiveteam-ot [02:57] *** LowLevelM has joined #archiveteam-ot [03:02] Thanks atphoenix for adding it to the deathwatch. I'm not on VF, and don't know anyone who is, but I heard it was shuting down. [03:24] *** atphoenix has quit IRC (irc.efnet.nl efnet.deic.eu) [03:24] *** benjins has quit IRC (irc.efnet.nl efnet.deic.eu) [03:24] *** britmob has quit IRC (irc.efnet.nl efnet.deic.eu) [03:24] *** kiska3 has quit IRC (irc.efnet.nl efnet.deic.eu) [03:25] *** britmob_ has joined #archiveteam-ot [03:26] *** MrRadar2 has quit IRC (Read error: Operation timed out) [03:31] *** benjinsmi has joined #archiveteam-ot [03:31] *** MrRadar2 has joined #archiveteam-ot [03:33] *** benjinss has joined #archiveteam-ot [03:39] *** britmob_ has quit IRC (Remote host closed the connection) [03:40] *** atphoenix has joined #archiveteam-ot [03:41] Lol, look what related search suggestion I got from searching 'presswire J.B. Hunt Transport Services has acquired the RDI Last Mile Company, which provides home delivery services of big and bulky products, including furniture, in the northeastern U.S.' on Google, [03:41] I got: http web mit edu /~ ecprice public wordlist ranked [03:41] What the hell is this search suggestion? xD [03:41] *** SoraUta has quit IRC (Read error: Connection reset by peer) [03:41] *** SoraUta has joined #archiveteam-ot [03:42] *** benjinsmi has quit IRC (Read error: Operation timed out) [03:42] *** britmob has joined #archiveteam-ot [04:07] *** kiska3 has joined #archiveteam-ot [04:20] *** qw3rty2 has joined #archiveteam-ot [04:29] *** qw3rty has quit IRC (Ping timeout: 745 seconds) [04:55] *** nicolas17 has quit IRC (Quit: Konversation terminated!) [05:06] *** odemg has quit IRC (Ping timeout: 745 seconds) [05:11] *** odemg has joined #archiveteam-ot [05:24] *** markedL has joined #archiveteam-ot [05:54] *** markedL has quit IRC (Quit: The Lounge - https://thelounge.chat) [05:55] *** marked1 has joined #archiveteam-ot [06:31] *** asdf0101 has joined #archiveteam-ot [07:22] *** dhyan_nat has joined #archiveteam-ot [07:34] *** oxguy3 has joined #archiveteam-ot [08:42] *** oxguy3 has quit IRC (Quit: My MacBook has gone to sleep. ZZZzzz…) [09:20] *** VoynichCr has quit IRC (Quit: leaving) [10:28] *** Mateon1 has quit IRC (Remote host closed the connection) [10:28] *** Mateon1 has joined #archiveteam-ot [10:56] *** BlueMax has quit IRC (Read error: Connection reset by peer) [10:57] *** BlueMax has joined #archiveteam-ot [10:59] *** dxrt_ has quit IRC (The Lounge - https://thelounge.chat) [11:04] *** oxguy3 has joined #archiveteam-ot [11:17] *** oxguy3 has quit IRC (Quit: My MacBook has gone to sleep. ZZZzzz…) [11:24] *** schbirid has joined #archiveteam-ot [11:30] *** SoraUta has quit IRC (Read error: Operation timed out) [11:39] *** dxrt_ has joined #archiveteam-ot [11:39] *** dxrt sets mode: +o dxrt_ [11:45] *** BlueMax has quit IRC (Read error: Connection reset by peer) [11:56] *** dxrt_ has quit IRC (The Lounge - https://thelounge.chat) [11:56] *** dxrt_ has joined #archiveteam-ot [11:56] *** dxrt sets mode: +o dxrt_ [13:24] *** tuluu has quit IRC (Quit: No Ping reply in 180 seconds.) [13:26] *** tuluu has joined #archiveteam-ot [14:01] *** X-Scale` has joined #archiveteam-ot [14:12] I have a wget process that's been running since november. It is using a gig of ram on a system that can barely spare that much. It seems like I may need to kill it and start over. Is there a way to tell wget to refer to a .warc for content previously fetched so it can spend less time talking to the actual server to catch back up? [14:12] *** X-Scale has quit IRC (Ping timeout: 745 seconds) [14:12] *** X-Scale` is now known as X-Scale [15:31] *** limb has quit IRC (WeeChat 2.2) [15:34] *** LowLevelM has quit IRC (Read error: Operation timed out) [16:29] stuff like that is why wpull rules [16:30] I found wpull about a month after I started this job, lol. [16:33] been there, still forgetting to use it :} [16:37] F [16:39] I don't think there is such an option. [16:45] In theory, you could extract the WARC-Target-URIs from the WARC(s) and build a --rejlist from it, but that'll fail very quickly due to command length limits, and I don't think there's a way to pass it in through a file. [16:45] Besides, most of that memory usage comes from the URL table probably, and that would just be rebuilt anyway, so the new process would probably shoot up to roughly the same RSS pretty quickly. [16:47] what version of wget, it could have a memory leak as well [16:49] GNU Wget 1.19.1 built on freebsd11.0. [16:52] I have been thinking about ways to break this job up a bit. The website in question has very distinct sections. job A could completely ignore the url namespace that job B does, for example. Job B could very efficiently get the necessary URL list programmatically. [16:52] plus, I want to mess with wpull [16:53] so it's probably not the worst thing to need to kill this job and then pick it up later. [16:55] JAA that would result in an incomplete crawl because the todo's from the html in the .WARC need to be extracted and were only in RAM [16:57] I have wondered about building an archivebot project for this crawl, but my goals aren't exactly in line with the archiveteam goals, so I would probably end up needing to build out my own archivebot instance. [17:01] marked1: Yes, obviously the list would have to be filtered, but you could for example exclude images, videos, and stuff like that, safely. [17:01] prq: grab-site [17:02] I see what you mean, redo the HTML part, skip media [17:10] prq: (your original question) That's something we touched on in #wget on freenode a couple weeks ago. You might join there and repeat your prediciment to darnir over there. lead dev [17:11] If it's not something he can reshape in wget1, it's certainly something he'd want to do in wget2 (the current love child) [17:11] *** LowLevelM has joined #archiveteam-ot [17:12] but it does take users to voice these things to come to grips with how wget's being used [17:12] it's technically possible, and easier in wget-lua but IDK if it's the best use of developer time [17:13] i personally want to see wget write an out_links file instead of storing it in ram, at least per --switch request, or when ram usage gets too high. [17:14] but mainly so a session can be interrupted and restarted where it left off. and also so a --spider session can turn into a wget -i list.txt [17:16] *** qw3rty2 has quit IRC (Quit: Nettalk6 - www.ntalk.de) [17:17] *** qw3rty has joined #archiveteam-ot [17:26] It seems unlikely to me that they'd add resumption from WARC files since that's a very niche use and would require including a WARC parser (whereas wget is only capable of writing WARCs currently), but can't hurt to ask. [17:27] But yeah, storing the URL table on disk rather than in RAM is certainly something worth implementing. [17:28] I mean, that was one of the main reasons (as far as I know) why a whole wget clone was written: wpull. [17:31] Raccoon▸ I will pop over there for sure. [17:44] *** qw3rty has quit IRC (Quit: Nettalk6 - www.ntalk.de) [17:47] *** qw3rty has joined #archiveteam-ot [18:02] there's a lot about this project I haven't figured out still. Am I able to upload a .warc to the wayback machine? do I need to be vetted? Do I need to do this job in the archivebot if that's my goal? I know this group != archive.org [18:02] I think this 87G .warc.gz file is a bit big since I think I saw somewhere archive.org takes .warc files in 50GiB chunks [18:03] and if archive.org / WBM isn't an option, should I be looking into running my own instance of a warc viewer? [18:04] I saw an archiveteam github repo that looked like a warc deduplicator [18:04] so maybe I ought to run my big warc through that? [18:12] first off, what are you archiving? [18:12] I'm archiving a large church website (not scientology) [18:13] they tend to like to rewrite history by retroactively changing what they publish. [18:13] and there are lots of holes in the wayback machine [18:14] why is it so large? videos, images, text ? Is everything on the open web? [18:14] the big sections of interest are talk/sermon archives, news articles, and even canonized scripture. [18:14] everything I'm archiving is open to the public without login required. [18:14] there are videos, but I'm happy to skip those. [18:14] (I think the current wget is configured to skip them) [18:14] there are pdfs and images though. [18:15] 87GB, do you know what percentage that represents? [18:16] my current progress: 87GiB .gz compressed / 222GiB uncompressed, 1,719,051 requests total. I can do some analysis to try to figure out what is represented there. [18:18] I believe it is very rare for grab donations to be loaded into WBM, I'd inquire about that first if it's an important goal [18:19] part of why the request count is so high is they have some non-deterministic URLs that return the same content. Instead of using an anchor for a particular verse number, they'll have www.foo.com/scriptures/book/20 and www.foo.com/scriptures/book/20.1 return chapter 20, but the latter highlights verse 1. [18:19] part of why I did this was so that I *could* map things out and see what more intelligent grabs might look like. [18:21] *** cerca has joined #archiveteam-ot [18:22] My ideal outcome would definitely be to get more of this stuff into the WBM, but that's not my only possible outcome. [19:06] prq: If you want to get it into the WBM, then yes, your account must be whitelisted for that. [19:08] Easiest way to get it into the WBM is indeed AB, but whether or not that is a good idea for a large website is another question obviously. [19:08] An 87 GiB WARC should be okay unless it contains millions of URLs. [19:30] *** systwi_ has joined #archiveteam-ot [19:36] *** systwi has quit IRC (Ping timeout: 622 seconds) [19:37] *** qw3rty has quit IRC (Remote host closed the connection) [19:37] *** qw3rty has joined #archiveteam-ot [19:54] prq: other option is to create a curated archive of asset files with pretty folders and descriptors; de-websited. Upload to IA collections. Target 10 GB per 7z [20:33] *** SoraUta has joined #archiveteam-ot [20:33] Raccoon▸ in this case, modifying what has been published will diminish the goal-- I'm hoping to help establish what has been published in over time to help shine a light on the 1984-esque modifying the past. [20:35] my "grab everything" mentality does not need to be the only approach either-- I could greatly reduce the amount of data by targeting specific stuff that is commonly referred to. [20:36] I did some analysis on my .cdx file to try to determine the byte count of everything, but I may not be understanding how .cdx works properly. My text/html byte count sum across all 200 responses are way way bigger than the entire uncompressed .warc [20:44] https://pastebin.com/EsDEJva2 - I awk'd out all the content-type and bytes for all the 200 status code responses listed in the .cdx and imported those two values into a .sqlite3 and did a select type sum(size) group by type to get this report. [21:00] *** Mateon1 has quit IRC (Remote host closed the connection) [21:00] *** Mateon1 has joined #archiveteam-ot [21:12] I'm watching http://dashboard.at.ninjawedding.org/?showNicks=1 and some of the jobs seem to encounter short URLs. Does archivebot feed those back into the http://urlte.am/ project? [21:12] or does http://urlte.am/ simply stick to its brute force method? [21:15] oh man, I have a bit of a confession-- https://www.archiveteam.org/index.php?title=Tabblo [21:15] I am the person who had hands on keyboard who took that service offline. [21:15] the dirty work of a junior sysadmin (back then) [21:16] there was actually a bug in the code that prevented some of the content from being downloaded. Ned reached out to me, but I didn't have authorization to do anything to fix the bug. something about unicode characters. [21:20] Is the blog post accurate in that if something broke, nobody would know how to fix it? [21:27] ned's post? basically yeah. once he left, there wasn't anyone left who really knew that codebase. [21:27] we used a couple of the components for a different service [21:27] but if something did break and management would have wanted it fixed, they could have pulled someone who could take a look and fix it. [21:27] I was more devops/deploy/sysadmin [21:28] but I've done a good chunk of python (mostly on the side) [21:28] so given enough time I probably would have been able to fix some of their components. [21:28] *** dhyan_nat has quit IRC (Read error: Operation timed out) [21:30] I think I ended up live-editing the django template to put the shutdown notice banner up. I don't recall how the email notifications were handled (not by me, but I do recall that something happened) [21:31] then the final plug-pull was an nginx change, iirc. [21:31] there was still a bunch of the data sitting on disks in the datacenter-- I left before anything was decided about those. [21:33] the technical part of doing it was never a big deal, it was waiting for management to decide what to do. that happened several levels above me, so I had no visibility into that decision process. [21:34] the wiki needs the shutdown notice added [21:38] http://web.archive.org/web/20120521042216/http://www.tabblo.com:80/studio/ - http://web.archive.org/web/20120701200104/tabblo.com/ [21:39] *** qw3rty has quit IRC (Read error: Connection reset by peer) [21:39] *** qw3rty has joined #archiveteam-ot [21:48] the one before that. On so and so date, .... [21:50] I remember that was a really weird day. I was in India and did it from the hotel lobby. [22:12] *** martini has joined #archiveteam-ot [22:30] *** nyany_ has quit IRC (Read error: Connection reset by peer) [22:34] *** nyany_ has joined #archiveteam-ot [22:48] *** martini has quit IRC (Ping timeout: 360 seconds) [22:53] *** BlueMax has joined #archiveteam-ot [23:01] *** icedice has joined #archiveteam-ot [23:16] *** schbirid has quit IRC (Quit: Leaving) [23:40] *** qw3rty has quit IRC (Remote host closed the connection) [23:40] *** qw3rty has joined #archiveteam-ot