#archiveteam-ot 2020-01-03,Fri

↑back Search

Time	Nickname	Message
00:10 ^🔗	atphoenix	hrrm. EFNet: "You have joined too many channels". well that sucks
00:13 ^🔗	atphoenix	it is a bit interesting that this channel (-ot) has a topic that refers to bikesheds ('bs'), but the other channel is the one that ends with -bs :)
00:42 ^🔗		BlueMax has quit IRC (Read error: Connection reset by peer)
01:09 ^🔗		Wingy has quit IRC (The Lounge - https://thelounge.chat)
01:10 ^🔗		Wingy has joined #archiveteam-ot
01:25 ^🔗	josey	Has anything been done for VampireFreaks? https://vampirefreaks.com/journal_entry/8876284 - social network closing February 1st 2020
01:26 ^🔗	kiska	Is the internet as fire prone as Australia atm?
01:35 ^🔗	Frogging	it always is
01:41 ^🔗	josey	https://en.wikipedia.org/wiki/Vampirefreaks.com - started in 1999. I wonder how much is already on archive.org?
01:51 ^🔗		BlueMax has joined #archiveteam-ot
02:03 ^🔗		asdf0101 has quit IRC (The Lounge - https://thelounge.chat)
02:03 ^🔗		markedL has quit IRC (Quit: The Lounge - https://thelounge.chat)
02:14 ^🔗	atphoenix	Per Deathwatch, the Internet is very 'fire' prone. And also can be prone to real fires. Some originals were lost in last year's California fires...only the copies elsewhere survived.
02:15 ^🔗		Wingy has quit IRC (The Lounge - https://thelounge.chat)
02:15 ^🔗		Wingy has joined #archiveteam-ot
02:35 ^🔗		X-Scale has quit IRC (Ping timeout: 745 seconds)
02:40 ^🔗	atphoenix	josey, added VF to the deathwatch
02:41 ^🔗	atphoenix	if you know people in that community, maybe they can make submissions to IA via SPN of stuff they want to save.
02:41 ^🔗	atphoenix	I saved a few journal entries via SPN
02:50 ^🔗		LowLevelM has quit IRC (Remote host closed the connection)
02:52 ^🔗		X-Scale has joined #archiveteam-ot
02:57 ^🔗		LowLevelM has joined #archiveteam-ot
03:02 ^🔗	josey	Thanks atphoenix for adding it to the deathwatch. I'm not on VF, and don't know anyone who is, but I heard it was shuting down.
03:24 ^🔗		atphoenix has quit IRC (irc.efnet.nl efnet.deic.eu)
03:24 ^🔗		benjins has quit IRC (irc.efnet.nl efnet.deic.eu)
03:24 ^🔗		britmob has quit IRC (irc.efnet.nl efnet.deic.eu)
03:24 ^🔗		kiska3 has quit IRC (irc.efnet.nl efnet.deic.eu)
03:25 ^🔗		britmob_ has joined #archiveteam-ot
03:26 ^🔗		MrRadar2 has quit IRC (Read error: Operation timed out)
03:31 ^🔗		benjinsmi has joined #archiveteam-ot
03:31 ^🔗		MrRadar2 has joined #archiveteam-ot
03:33 ^🔗		benjinss has joined #archiveteam-ot
03:39 ^🔗		britmob_ has quit IRC (Remote host closed the connection)
03:40 ^🔗		atphoenix has joined #archiveteam-ot
03:41 ^🔗	Ryz	Lol, look what related search suggestion I got from searching 'presswire J.B. Hunt Transport Services has acquired the RDI Last Mile Company, which provides home delivery services of big and bulky products, including furniture, in the northeastern U.S.' on Google,
03:41 ^🔗	Ryz	I got: http web mit edu /~ ecprice public wordlist ranked
03:41 ^🔗	Ryz	What the hell is this search suggestion? xD
03:41 ^🔗		SoraUta has quit IRC (Read error: Connection reset by peer)
03:41 ^🔗		SoraUta has joined #archiveteam-ot
03:42 ^🔗		benjinsmi has quit IRC (Read error: Operation timed out)
03:42 ^🔗		britmob has joined #archiveteam-ot
04:07 ^🔗		kiska3 has joined #archiveteam-ot
04:20 ^🔗		qw3rty2 has joined #archiveteam-ot
04:29 ^🔗		qw3rty has quit IRC (Ping timeout: 745 seconds)
04:55 ^🔗		nicolas17 has quit IRC (Quit: Konversation terminated!)
05:06 ^🔗		odemg has quit IRC (Ping timeout: 745 seconds)
05:11 ^🔗		odemg has joined #archiveteam-ot
05:24 ^🔗		markedL has joined #archiveteam-ot
05:54 ^🔗		markedL has quit IRC (Quit: The Lounge - https://thelounge.chat)
05:55 ^🔗		marked1 has joined #archiveteam-ot
06:31 ^🔗		asdf0101 has joined #archiveteam-ot
07:22 ^🔗		dhyan_nat has joined #archiveteam-ot
07:34 ^🔗		oxguy3 has joined #archiveteam-ot
08:42 ^🔗		oxguy3 has quit IRC (Quit: My MacBook has gone to sleep. ZZZzzz…)
09:20 ^🔗		VoynichCr has quit IRC (Quit: leaving)
10:28 ^🔗		Mateon1 has quit IRC (Remote host closed the connection)
10:28 ^🔗		Mateon1 has joined #archiveteam-ot
10:56 ^🔗		BlueMax has quit IRC (Read error: Connection reset by peer)
10:57 ^🔗		BlueMax has joined #archiveteam-ot
10:59 ^🔗		dxrt_ has quit IRC (The Lounge - https://thelounge.chat)
11:04 ^🔗		oxguy3 has joined #archiveteam-ot
11:17 ^🔗		oxguy3 has quit IRC (Quit: My MacBook has gone to sleep. ZZZzzz…)
11:24 ^🔗		schbirid has joined #archiveteam-ot
11:30 ^🔗		SoraUta has quit IRC (Read error: Operation timed out)
11:39 ^🔗		dxrt_ has joined #archiveteam-ot
11:39 ^🔗		dxrt sets mode: +o dxrt_
11:45 ^🔗		BlueMax has quit IRC (Read error: Connection reset by peer)
11:56 ^🔗		dxrt_ has quit IRC (The Lounge - https://thelounge.chat)
11:56 ^🔗		dxrt_ has joined #archiveteam-ot
11:56 ^🔗		dxrt sets mode: +o dxrt_
13:24 ^🔗		tuluu has quit IRC (Quit: No Ping reply in 180 seconds.)
13:26 ^🔗		tuluu has joined #archiveteam-ot
14:01 ^🔗		X-Scale` has joined #archiveteam-ot
14:12 ^🔗	prq	I have a wget process that's been running since november. It is using a gig of ram on a system that can barely spare that much. It seems like I may need to kill it and start over. Is there a way to tell wget to refer to a .warc for content previously fetched so it can spend less time talking to the actual server to catch back up?
14:12 ^🔗		X-Scale has quit IRC (Ping timeout: 745 seconds)
14:12 ^🔗		X-Scale` is now known as X-Scale
15:31 ^🔗		limb has quit IRC (WeeChat 2.2)
15:34 ^🔗		LowLevelM has quit IRC (Read error: Operation timed out)
16:29 ^🔗	schbirid	stuff like that is why wpull rules
16:30 ^🔗	prq	I found wpull about a month after I started this job, lol.
16:33 ^🔗	schbirid	been there, still forgetting to use it :}
16:37 ^🔗	JAA	F
16:39 ^🔗	JAA	I don't think there is such an option.
16:45 ^🔗	JAA	In theory, you could extract the WARC-Target-URIs from the WARC(s) and build a --rejlist from it, but that'll fail very quickly due to command length limits, and I don't think there's a way to pass it in through a file.
16:45 ^🔗	JAA	Besides, most of that memory usage comes from the URL table probably, and that would just be rebuilt anyway, so the new process would probably shoot up to roughly the same RSS pretty quickly.
16:47 ^🔗	marked1	what version of wget, it could have a memory leak as well
16:49 ^🔗	prq	GNU Wget 1.19.1 built on freebsd11.0.
16:52 ^🔗	prq	I have been thinking about ways to break this job up a bit. The website in question has very distinct sections. job A could completely ignore the url namespace that job B does, for example. Job B could very efficiently get the necessary URL list programmatically.
16:52 ^🔗	prq	plus, I want to mess with wpull
16:53 ^🔗	prq	so it's probably not the worst thing to need to kill this job and then pick it up later.
16:55 ^🔗	marked1	JAA that would result in an incomplete crawl because the todo's from the html in the .WARC need to be extracted and were only in RAM
16:57 ^🔗	prq	I have wondered about building an archivebot project for this crawl, but my goals aren't exactly in line with the archiveteam goals, so I would probably end up needing to build out my own archivebot instance.
17:01 ^🔗	JAA	marked1: Yes, obviously the list would have to be filtered, but you could for example exclude images, videos, and stuff like that, safely.
17:01 ^🔗	JAA	prq: grab-site
17:02 ^🔗	marked1	I see what you mean, redo the HTML part, skip media
17:10 ^🔗	Raccoon	prq: (your original question) That's something we touched on in #wget on freenode a couple weeks ago. You might join there and repeat your prediciment to darnir over there. lead dev
17:11 ^🔗	Raccoon	If it's not something he can reshape in wget1, it's certainly something he'd want to do in wget2 (the current love child)
17:11 ^🔗		LowLevelM has joined #archiveteam-ot
17:12 ^🔗	Raccoon	but it does take users to voice these things to come to grips with how wget's being used
17:12 ^🔗	marked1	it's technically possible, and easier in wget-lua but IDK if it's the best use of developer time
17:13 ^🔗	Raccoon	i personally want to see wget write an out_links file instead of storing it in ram, at least per --switch request, or when ram usage gets too high.
17:14 ^🔗	Raccoon	but mainly so a session can be interrupted and restarted where it left off. and also so a --spider session can turn into a wget -i list.txt
17:16 ^🔗		qw3rty2 has quit IRC (Quit: Nettalk6 - www.ntalk.de)
17:17 ^🔗		qw3rty has joined #archiveteam-ot
17:26 ^🔗	JAA	It seems unlikely to me that they'd add resumption from WARC files since that's a very niche use and would require including a WARC parser (whereas wget is only capable of writing WARCs currently), but can't hurt to ask.
17:27 ^🔗	JAA	But yeah, storing the URL table on disk rather than in RAM is certainly something worth implementing.
17:28 ^🔗	JAA	I mean, that was one of the main reasons (as far as I know) why a whole wget clone was written: wpull.
17:31 ^🔗	prq	Raccoon▸ I will pop over there for sure.
17:44 ^🔗		qw3rty has quit IRC (Quit: Nettalk6 - www.ntalk.de)
17:47 ^🔗		qw3rty has joined #archiveteam-ot
18:02 ^🔗	prq	there's a lot about this project I haven't figured out still. Am I able to upload a .warc to the wayback machine? do I need to be vetted? Do I need to do this job in the archivebot if that's my goal? I know this group != archive.org
18:02 ^🔗	prq	I think this 87G .warc.gz file is a bit big since I think I saw somewhere archive.org takes .warc files in 50GiB chunks
18:03 ^🔗	prq	and if archive.org / WBM isn't an option, should I be looking into running my own instance of a warc viewer?
18:04 ^🔗	prq	I saw an archiveteam github repo that looked like a warc deduplicator
18:04 ^🔗	prq	so maybe I ought to run my big warc through that?
18:12 ^🔗	marked1	first off, what are you archiving?
18:12 ^🔗	prq	I'm archiving a large church website (not scientology)
18:13 ^🔗	prq	they tend to like to rewrite history by retroactively changing what they publish.
18:13 ^🔗	prq	and there are lots of holes in the wayback machine
18:14 ^🔗	marked1	why is it so large? videos, images, text ? Is everything on the open web?
18:14 ^🔗	prq	the big sections of interest are talk/sermon archives, news articles, and even canonized scripture.
18:14 ^🔗	prq	everything I'm archiving is open to the public without login required.
18:14 ^🔗	prq	there are videos, but I'm happy to skip those.
18:14 ^🔗	prq	(I think the current wget is configured to skip them)
18:14 ^🔗	prq	there are pdfs and images though.
18:15 ^🔗	marked1	87GB, do you know what percentage that represents?
18:16 ^🔗	prq	my current progress: 87GiB .gz compressed / 222GiB uncompressed, 1,719,051 requests total. I can do some analysis to try to figure out what is represented there.
18:18 ^🔗	marked1	I believe it is very rare for grab donations to be loaded into WBM, I'd inquire about that first if it's an important goal
18:19 ^🔗	prq	part of why the request count is so high is they have some non-deterministic URLs that return the same content. Instead of using an anchor for a particular verse number, they'll have www.foo.com/scriptures/book/20 and www.foo.com/scriptures/book/20.1 return chapter 20, but the latter highlights verse 1.
18:19 ^🔗	prq	part of why I did this was so that I could map things out and see what more intelligent grabs might look like.
18:21 ^🔗		cerca has joined #archiveteam-ot
18:22 ^🔗	prq	My ideal outcome would definitely be to get more of this stuff into the WBM, but that's not my only possible outcome.
19:06 ^🔗	JAA	prq: If you want to get it into the WBM, then yes, your account must be whitelisted for that.
19:08 ^🔗	JAA	Easiest way to get it into the WBM is indeed AB, but whether or not that is a good idea for a large website is another question obviously.
19:08 ^🔗	JAA	An 87 GiB WARC should be okay unless it contains millions of URLs.
19:30 ^🔗		systwi_ has joined #archiveteam-ot
19:36 ^🔗		systwi has quit IRC (Ping timeout: 622 seconds)
19:37 ^🔗		qw3rty has quit IRC (Remote host closed the connection)
19:37 ^🔗		qw3rty has joined #archiveteam-ot
19:54 ^🔗	Raccoon	prq: other option is to create a curated archive of asset files with pretty folders and descriptors; de-websited. Upload to IA collections. Target 10 GB per 7z
20:33 ^🔗		SoraUta has joined #archiveteam-ot
20:33 ^🔗	prq	Raccoon▸ in this case, modifying what has been published will diminish the goal-- I'm hoping to help establish what has been published in over time to help shine a light on the 1984-esque modifying the past.
20:35 ^🔗	prq	my "grab everything" mentality does not need to be the only approach either-- I could greatly reduce the amount of data by targeting specific stuff that is commonly referred to.
20:36 ^🔗	prq	I did some analysis on my .cdx file to try to determine the byte count of everything, but I may not be understanding how .cdx works properly. My text/html byte count sum across all 200 responses are way way bigger than the entire uncompressed .warc
20:44 ^🔗	prq	https://pastebin.com/EsDEJva2 - I awk'd out all the content-type and bytes for all the 200 status code responses listed in the .cdx and imported those two values into a .sqlite3 and did a select type sum(size) group by type to get this report.
21:00 ^🔗		Mateon1 has quit IRC (Remote host closed the connection)
21:00 ^🔗		Mateon1 has joined #archiveteam-ot
21:12 ^🔗	prq	I'm watching http://dashboard.at.ninjawedding.org/?showNicks=1 and some of the jobs seem to encounter short URLs. Does archivebot feed those back into the http://urlte.am/ project?
21:12 ^🔗	prq	or does http://urlte.am/ simply stick to its brute force method?
21:15 ^🔗	prq	oh man, I have a bit of a confession-- https://www.archiveteam.org/index.php?title=Tabblo
21:15 ^🔗	prq	I am the person who had hands on keyboard who took that service offline.
21:15 ^🔗	prq	the dirty work of a junior sysadmin (back then)
21:16 ^🔗	prq	there was actually a bug in the code that prevented some of the content from being downloaded. Ned reached out to me, but I didn't have authorization to do anything to fix the bug. something about unicode characters.
21:20 ^🔗	marked1	Is the blog post accurate in that if something broke, nobody would know how to fix it?
21:27 ^🔗	prq	ned's post? basically yeah. once he left, there wasn't anyone left who really knew that codebase.
21:27 ^🔗	prq	we used a couple of the components for a different service
21:27 ^🔗	prq	but if something did break and management would have wanted it fixed, they could have pulled someone who could take a look and fix it.
21:27 ^🔗	prq	I was more devops/deploy/sysadmin
21:28 ^🔗	prq	but I've done a good chunk of python (mostly on the side)
21:28 ^🔗	prq	so given enough time I probably would have been able to fix some of their components.
21:28 ^🔗		dhyan_nat has quit IRC (Read error: Operation timed out)
21:30 ^🔗	prq	I think I ended up live-editing the django template to put the shutdown notice banner up. I don't recall how the email notifications were handled (not by me, but I do recall that something happened)
21:31 ^🔗	prq	then the final plug-pull was an nginx change, iirc.
21:31 ^🔗	prq	there was still a bunch of the data sitting on disks in the datacenter-- I left before anything was decided about those.
21:33 ^🔗	prq	the technical part of doing it was never a big deal, it was waiting for management to decide what to do. that happened several levels above me, so I had no visibility into that decision process.
21:34 ^🔗	marked1	the wiki needs the shutdown notice added
21:38 ^🔗	prq	http://web.archive.org/web/20120521042216/http://www.tabblo.com:80/studio/ - http://web.archive.org/web/20120701200104/tabblo.com/
21:39 ^🔗		qw3rty has quit IRC (Read error: Connection reset by peer)
21:39 ^🔗		qw3rty has joined #archiveteam-ot
21:48 ^🔗	marked1	the one before that. On so and so date, ....
21:50 ^🔗	prq	I remember that was a really weird day. I was in India and did it from the hotel lobby.
22:12 ^🔗		martini has joined #archiveteam-ot
22:30 ^🔗		nyany_ has quit IRC (Read error: Connection reset by peer)
22:34 ^🔗		nyany_ has joined #archiveteam-ot
22:48 ^🔗		martini has quit IRC (Ping timeout: 360 seconds)
22:53 ^🔗		BlueMax has joined #archiveteam-ot
23:01 ^🔗		icedice has joined #archiveteam-ot
23:16 ^🔗		schbirid has quit IRC (Quit: Leaving)
23:40 ^🔗		qw3rty has quit IRC (Remote host closed the connection)
23:40 ^🔗		qw3rty has joined #archiveteam-ot

irclogger-viewer