#archiveteam-ot 2020-01-03,Fri

↑back Search

Time Nickname Message
00:10 πŸ”— atphoenix hrrm. EFNet: "You have joined too many channels". well that sucks
00:13 πŸ”— atphoenix it is a bit interesting that this channel (-ot) has a topic that refers to bikesheds ('bs'), but the other channel is the one that ends with -bs :)
00:42 πŸ”— BlueMax has quit IRC (Read error: Connection reset by peer)
01:09 πŸ”— Wingy has quit IRC (The Lounge - https://thelounge.chat)
01:10 πŸ”— Wingy has joined #archiveteam-ot
01:25 πŸ”— josey Has anything been done for VampireFreaks? https://vampirefreaks.com/journal_entry/8876284 - social network closing February 1st 2020
01:26 πŸ”— kiska Is the internet as fire prone as Australia atm?
01:35 πŸ”— Frogging it always is
01:41 πŸ”— josey https://en.wikipedia.org/wiki/Vampirefreaks.com - started in 1999. I wonder how much is already on archive.org?
01:51 πŸ”— BlueMax has joined #archiveteam-ot
02:03 πŸ”— asdf0101 has quit IRC (The Lounge - https://thelounge.chat)
02:03 πŸ”— markedL has quit IRC (Quit: The Lounge - https://thelounge.chat)
02:14 πŸ”— atphoenix Per Deathwatch, the Internet is very 'fire' prone. And also can be prone to real fires. Some originals were lost in last year's California fires...only the copies elsewhere survived.
02:15 πŸ”— Wingy has quit IRC (The Lounge - https://thelounge.chat)
02:15 πŸ”— Wingy has joined #archiveteam-ot
02:35 πŸ”— X-Scale has quit IRC (Ping timeout: 745 seconds)
02:40 πŸ”— atphoenix josey, added VF to the deathwatch
02:41 πŸ”— atphoenix if you know people in that community, maybe they can make submissions to IA via SPN of stuff they want to save.
02:41 πŸ”— atphoenix I saved a few journal entries via SPN
02:50 πŸ”— LowLevelM has quit IRC (Remote host closed the connection)
02:52 πŸ”— X-Scale has joined #archiveteam-ot
02:57 πŸ”— LowLevelM has joined #archiveteam-ot
03:02 πŸ”— josey Thanks atphoenix for adding it to the deathwatch. I'm not on VF, and don't know anyone who is, but I heard it was shuting down.
03:24 πŸ”— atphoenix has quit IRC (irc.efnet.nl efnet.deic.eu)
03:24 πŸ”— benjins has quit IRC (irc.efnet.nl efnet.deic.eu)
03:24 πŸ”— britmob has quit IRC (irc.efnet.nl efnet.deic.eu)
03:24 πŸ”— kiska3 has quit IRC (irc.efnet.nl efnet.deic.eu)
03:25 πŸ”— britmob_ has joined #archiveteam-ot
03:26 πŸ”— MrRadar2 has quit IRC (Read error: Operation timed out)
03:31 πŸ”— benjinsmi has joined #archiveteam-ot
03:31 πŸ”— MrRadar2 has joined #archiveteam-ot
03:33 πŸ”— benjinss has joined #archiveteam-ot
03:39 πŸ”— britmob_ has quit IRC (Remote host closed the connection)
03:40 πŸ”— atphoenix has joined #archiveteam-ot
03:41 πŸ”— Ryz Lol, look what related search suggestion I got from searching 'presswire J.B. Hunt Transport Services has acquired the RDI Last Mile Company, which provides home delivery services of big and bulky products, including furniture, in the northeastern U.S.' on Google,
03:41 πŸ”— Ryz I got: http web mit edu /~ ecprice public wordlist ranked
03:41 πŸ”— Ryz What the hell is this search suggestion? xD
03:41 πŸ”— SoraUta has quit IRC (Read error: Connection reset by peer)
03:41 πŸ”— SoraUta has joined #archiveteam-ot
03:42 πŸ”— benjinsmi has quit IRC (Read error: Operation timed out)
03:42 πŸ”— britmob has joined #archiveteam-ot
04:07 πŸ”— kiska3 has joined #archiveteam-ot
04:20 πŸ”— qw3rty2 has joined #archiveteam-ot
04:29 πŸ”— qw3rty has quit IRC (Ping timeout: 745 seconds)
04:55 πŸ”— nicolas17 has quit IRC (Quit: Konversation terminated!)
05:06 πŸ”— odemg has quit IRC (Ping timeout: 745 seconds)
05:11 πŸ”— odemg has joined #archiveteam-ot
05:24 πŸ”— markedL has joined #archiveteam-ot
05:54 πŸ”— markedL has quit IRC (Quit: The Lounge - https://thelounge.chat)
05:55 πŸ”— marked1 has joined #archiveteam-ot
06:31 πŸ”— asdf0101 has joined #archiveteam-ot
07:22 πŸ”— dhyan_nat has joined #archiveteam-ot
07:34 πŸ”— oxguy3 has joined #archiveteam-ot
08:42 πŸ”— oxguy3 has quit IRC (Quit: My MacBook has gone to sleep. ZZZzzz…)
09:20 πŸ”— VoynichCr has quit IRC (Quit: leaving)
10:28 πŸ”— Mateon1 has quit IRC (Remote host closed the connection)
10:28 πŸ”— Mateon1 has joined #archiveteam-ot
10:56 πŸ”— BlueMax has quit IRC (Read error: Connection reset by peer)
10:57 πŸ”— BlueMax has joined #archiveteam-ot
10:59 πŸ”— dxrt_ has quit IRC (The Lounge - https://thelounge.chat)
11:04 πŸ”— oxguy3 has joined #archiveteam-ot
11:17 πŸ”— oxguy3 has quit IRC (Quit: My MacBook has gone to sleep. ZZZzzz…)
11:24 πŸ”— schbirid has joined #archiveteam-ot
11:30 πŸ”— SoraUta has quit IRC (Read error: Operation timed out)
11:39 πŸ”— dxrt_ has joined #archiveteam-ot
11:39 πŸ”— dxrt sets mode: +o dxrt_
11:45 πŸ”— BlueMax has quit IRC (Read error: Connection reset by peer)
11:56 πŸ”— dxrt_ has quit IRC (The Lounge - https://thelounge.chat)
11:56 πŸ”— dxrt_ has joined #archiveteam-ot
11:56 πŸ”— dxrt sets mode: +o dxrt_
13:24 πŸ”— tuluu has quit IRC (Quit: No Ping reply in 180 seconds.)
13:26 πŸ”— tuluu has joined #archiveteam-ot
14:01 πŸ”— X-Scale` has joined #archiveteam-ot
14:12 πŸ”— prq I have a wget process that's been running since november. It is using a gig of ram on a system that can barely spare that much. It seems like I may need to kill it and start over. Is there a way to tell wget to refer to a .warc for content previously fetched so it can spend less time talking to the actual server to catch back up?
14:12 πŸ”— X-Scale has quit IRC (Ping timeout: 745 seconds)
14:12 πŸ”— X-Scale` is now known as X-Scale
15:31 πŸ”— limb has quit IRC (WeeChat 2.2)
15:34 πŸ”— LowLevelM has quit IRC (Read error: Operation timed out)
16:29 πŸ”— schbirid stuff like that is why wpull rules
16:30 πŸ”— prq I found wpull about a month after I started this job, lol.
16:33 πŸ”— schbirid been there, still forgetting to use it :}
16:37 πŸ”— JAA F
16:39 πŸ”— JAA I don't think there is such an option.
16:45 πŸ”— JAA In theory, you could extract the WARC-Target-URIs from the WARC(s) and build a --rejlist from it, but that'll fail very quickly due to command length limits, and I don't think there's a way to pass it in through a file.
16:45 πŸ”— JAA Besides, most of that memory usage comes from the URL table probably, and that would just be rebuilt anyway, so the new process would probably shoot up to roughly the same RSS pretty quickly.
16:47 πŸ”— marked1 what version of wget, it could have a memory leak as well
16:49 πŸ”— prq GNU Wget 1.19.1 built on freebsd11.0.
16:52 πŸ”— prq I have been thinking about ways to break this job up a bit. The website in question has very distinct sections. job A could completely ignore the url namespace that job B does, for example. Job B could very efficiently get the necessary URL list programmatically.
16:52 πŸ”— prq plus, I want to mess with wpull
16:53 πŸ”— prq so it's probably not the worst thing to need to kill this job and then pick it up later.
16:55 πŸ”— marked1 JAA that would result in an incomplete crawl because the todo's from the html in the .WARC need to be extracted and were only in RAM
16:57 πŸ”— prq I have wondered about building an archivebot project for this crawl, but my goals aren't exactly in line with the archiveteam goals, so I would probably end up needing to build out my own archivebot instance.
17:01 πŸ”— JAA marked1: Yes, obviously the list would have to be filtered, but you could for example exclude images, videos, and stuff like that, safely.
17:01 πŸ”— JAA prq: grab-site
17:02 πŸ”— marked1 I see what you mean, redo the HTML part, skip media
17:10 πŸ”— Raccoon prq: (your original question) That's something we touched on in #wget on freenode a couple weeks ago. You might join there and repeat your prediciment to darnir over there. lead dev
17:11 πŸ”— Raccoon If it's not something he can reshape in wget1, it's certainly something he'd want to do in wget2 (the current love child)
17:11 πŸ”— LowLevelM has joined #archiveteam-ot
17:12 πŸ”— Raccoon but it does take users to voice these things to come to grips with how wget's being used
17:12 πŸ”— marked1 it's technically possible, and easier in wget-lua but IDK if it's the best use of developer time
17:13 πŸ”— Raccoon i personally want to see wget write an out_links file instead of storing it in ram, at least per --switch request, or when ram usage gets too high.
17:14 πŸ”— Raccoon but mainly so a session can be interrupted and restarted where it left off. and also so a --spider session can turn into a wget -i list.txt
17:16 πŸ”— qw3rty2 has quit IRC (Quit: Nettalk6 - www.ntalk.de)
17:17 πŸ”— qw3rty has joined #archiveteam-ot
17:26 πŸ”— JAA It seems unlikely to me that they'd add resumption from WARC files since that's a very niche use and would require including a WARC parser (whereas wget is only capable of writing WARCs currently), but can't hurt to ask.
17:27 πŸ”— JAA But yeah, storing the URL table on disk rather than in RAM is certainly something worth implementing.
17:28 πŸ”— JAA I mean, that was one of the main reasons (as far as I know) why a whole wget clone was written: wpull.
17:31 πŸ”— prq Raccoonβ–Έ I will pop over there for sure.
17:44 πŸ”— qw3rty has quit IRC (Quit: Nettalk6 - www.ntalk.de)
17:47 πŸ”— qw3rty has joined #archiveteam-ot
18:02 πŸ”— prq there's a lot about this project I haven't figured out still. Am I able to upload a .warc to the wayback machine? do I need to be vetted? Do I need to do this job in the archivebot if that's my goal? I know this group != archive.org
18:02 πŸ”— prq I think this 87G .warc.gz file is a bit big since I think I saw somewhere archive.org takes .warc files in 50GiB chunks
18:03 πŸ”— prq and if archive.org / WBM isn't an option, should I be looking into running my own instance of a warc viewer?
18:04 πŸ”— prq I saw an archiveteam github repo that looked like a warc deduplicator
18:04 πŸ”— prq so maybe I ought to run my big warc through that?
18:12 πŸ”— marked1 first off, what are you archiving?
18:12 πŸ”— prq I'm archiving a large church website (not scientology)
18:13 πŸ”— prq they tend to like to rewrite history by retroactively changing what they publish.
18:13 πŸ”— prq and there are lots of holes in the wayback machine
18:14 πŸ”— marked1 why is it so large? videos, images, text ? Is everything on the open web?
18:14 πŸ”— prq the big sections of interest are talk/sermon archives, news articles, and even canonized scripture.
18:14 πŸ”— prq everything I'm archiving is open to the public without login required.
18:14 πŸ”— prq there are videos, but I'm happy to skip those.
18:14 πŸ”— prq (I think the current wget is configured to skip them)
18:14 πŸ”— prq there are pdfs and images though.
18:15 πŸ”— marked1 87GB, do you know what percentage that represents?
18:16 πŸ”— prq my current progress: 87GiB .gz compressed / 222GiB uncompressed, 1,719,051 requests total. I can do some analysis to try to figure out what is represented there.
18:18 πŸ”— marked1 I believe it is very rare for grab donations to be loaded into WBM, I'd inquire about that first if it's an important goal
18:19 πŸ”— prq part of why the request count is so high is they have some non-deterministic URLs that return the same content. Instead of using an anchor for a particular verse number, they'll have www.foo.com/scriptures/book/20 and www.foo.com/scriptures/book/20.1 return chapter 20, but the latter highlights verse 1.
18:19 πŸ”— prq part of why I did this was so that I *could* map things out and see what more intelligent grabs might look like.
18:21 πŸ”— cerca has joined #archiveteam-ot
18:22 πŸ”— prq My ideal outcome would definitely be to get more of this stuff into the WBM, but that's not my only possible outcome.
19:06 πŸ”— JAA prq: If you want to get it into the WBM, then yes, your account must be whitelisted for that.
19:08 πŸ”— JAA Easiest way to get it into the WBM is indeed AB, but whether or not that is a good idea for a large website is another question obviously.
19:08 πŸ”— JAA An 87 GiB WARC should be okay unless it contains millions of URLs.
19:30 πŸ”— systwi_ has joined #archiveteam-ot
19:36 πŸ”— systwi has quit IRC (Ping timeout: 622 seconds)
19:37 πŸ”— qw3rty has quit IRC (Remote host closed the connection)
19:37 πŸ”— qw3rty has joined #archiveteam-ot
19:54 πŸ”— Raccoon prq: other option is to create a curated archive of asset files with pretty folders and descriptors; de-websited. Upload to IA collections. Target 10 GB per 7z
20:33 πŸ”— SoraUta has joined #archiveteam-ot
20:33 πŸ”— prq Raccoonβ–Έ in this case, modifying what has been published will diminish the goal-- I'm hoping to help establish what has been published in over time to help shine a light on the 1984-esque modifying the past.
20:35 πŸ”— prq my "grab everything" mentality does not need to be the only approach either-- I could greatly reduce the amount of data by targeting specific stuff that is commonly referred to.
20:36 πŸ”— prq I did some analysis on my .cdx file to try to determine the byte count of everything, but I may not be understanding how .cdx works properly. My text/html byte count sum across all 200 responses are way way bigger than the entire uncompressed .warc
20:44 πŸ”— prq https://pastebin.com/EsDEJva2 - I awk'd out all the content-type and bytes for all the 200 status code responses listed in the .cdx and imported those two values into a .sqlite3 and did a select type sum(size) group by type to get this report.
21:00 πŸ”— Mateon1 has quit IRC (Remote host closed the connection)
21:00 πŸ”— Mateon1 has joined #archiveteam-ot
21:12 πŸ”— prq I'm watching http://dashboard.at.ninjawedding.org/?showNicks=1 and some of the jobs seem to encounter short URLs. Does archivebot feed those back into the http://urlte.am/ project?
21:12 πŸ”— prq or does http://urlte.am/ simply stick to its brute force method?
21:15 πŸ”— prq oh man, I have a bit of a confession-- https://www.archiveteam.org/index.php?title=Tabblo
21:15 πŸ”— prq I am the person who had hands on keyboard who took that service offline.
21:15 πŸ”— prq the dirty work of a junior sysadmin (back then)
21:16 πŸ”— prq there was actually a bug in the code that prevented some of the content from being downloaded. Ned reached out to me, but I didn't have authorization to do anything to fix the bug. something about unicode characters.
21:20 πŸ”— marked1 Is the blog post accurate in that if something broke, nobody would know how to fix it?
21:27 πŸ”— prq ned's post? basically yeah. once he left, there wasn't anyone left who really knew that codebase.
21:27 πŸ”— prq we used a couple of the components for a different service
21:27 πŸ”— prq but if something did break and management would have wanted it fixed, they could have pulled someone who could take a look and fix it.
21:27 πŸ”— prq I was more devops/deploy/sysadmin
21:28 πŸ”— prq but I've done a good chunk of python (mostly on the side)
21:28 πŸ”— prq so given enough time I probably would have been able to fix some of their components.
21:28 πŸ”— dhyan_nat has quit IRC (Read error: Operation timed out)
21:30 πŸ”— prq I think I ended up live-editing the django template to put the shutdown notice banner up. I don't recall how the email notifications were handled (not by me, but I do recall that something happened)
21:31 πŸ”— prq then the final plug-pull was an nginx change, iirc.
21:31 πŸ”— prq there was still a bunch of the data sitting on disks in the datacenter-- I left before anything was decided about those.
21:33 πŸ”— prq the technical part of doing it was never a big deal, it was waiting for management to decide what to do. that happened several levels above me, so I had no visibility into that decision process.
21:34 πŸ”— marked1 the wiki needs the shutdown notice added
21:38 πŸ”— prq http://web.archive.org/web/20120521042216/http://www.tabblo.com:80/studio/ - http://web.archive.org/web/20120701200104/tabblo.com/
21:39 πŸ”— qw3rty has quit IRC (Read error: Connection reset by peer)
21:39 πŸ”— qw3rty has joined #archiveteam-ot
21:48 πŸ”— marked1 the one before that. On so and so date, ....
21:50 πŸ”— prq I remember that was a really weird day. I was in India and did it from the hotel lobby.
22:12 πŸ”— martini has joined #archiveteam-ot
22:30 πŸ”— nyany_ has quit IRC (Read error: Connection reset by peer)
22:34 πŸ”— nyany_ has joined #archiveteam-ot
22:48 πŸ”— martini has quit IRC (Ping timeout: 360 seconds)
22:53 πŸ”— BlueMax has joined #archiveteam-ot
23:01 πŸ”— icedice has joined #archiveteam-ot
23:16 πŸ”— schbirid has quit IRC (Quit: Leaving)
23:40 πŸ”— qw3rty has quit IRC (Remote host closed the connection)
23:40 πŸ”— qw3rty has joined #archiveteam-ot

irclogger-viewer