#archiveteam-bs 2019-01-29,Tue

↑back Search

Time	Nickname	Message
00:56 ^🔗		VerfiedJ has quit IRC (Quit: Leaving)
01:04 ^🔗		benjinsmi has quit IRC (Leaving)
01:22 ^🔗		Hani has quit IRC (Read error: Connection reset by peer)
01:22 ^🔗		Hani has joined #archiveteam-bs
01:50 ^🔗		Sk1d has quit IRC (Read error: Operation timed out)
01:53 ^🔗		Sk1d has joined #archiveteam-bs
02:17 ^🔗		icedice has quit IRC (Quit: Leaving)
02:19 ^🔗		Sk1d has quit IRC (Read error: Operation timed out)
02:21 ^🔗		Sk1d has joined #archiveteam-bs
02:38 ^🔗		ubahn_ has joined #archiveteam-bs
02:45 ^🔗		ubahn has quit IRC (Ping timeout: 615 seconds)
02:52 ^🔗		Wizzito has joined #archiveteam-bs
02:52 ^🔗	Wizzito	https://archive.org/details/WiiShopChannelBackup Y'all might appreciate this
02:52 ^🔗	Wizzito	Saw it and thought of archiveteam
03:02 ^🔗		Sk1d has quit IRC (Read error: Operation timed out)
03:04 ^🔗		ubahn_ has quit IRC (Quit: ubahn_)
03:06 ^🔗		Sk1d has joined #archiveteam-bs
03:51 ^🔗		newbie81 has joined #archiveteam-bs
03:52 ^🔗	newbie81	Hey all, can anyone recommend a good way of deduplicating warc files? I'm scraping a twitter feed that tends to delete lots of stuff, but my hourly cronjob is producing a lot of duplicate data
03:52 ^🔗		newbie81 is now known as jianaran
04:04 ^🔗		Wizzito has quit IRC (Quit: Leaving)
04:15 ^🔗		benjins has joined #archiveteam-bs
04:20 ^🔗		Sk1d has quit IRC (Read error: Operation timed out)
04:23 ^🔗		Sk1d has joined #archiveteam-bs
04:33 ^🔗		qw3rty117 has joined #archiveteam-bs
04:39 ^🔗		odemgi_ has joined #archiveteam-bs
04:39 ^🔗		qw3rty116 has quit IRC (Read error: Operation timed out)
04:42 ^🔗		odemgi has quit IRC (Ping timeout: 252 seconds)
04:42 ^🔗		odemg has quit IRC (Ping timeout: 265 seconds)
04:55 ^🔗		odemg has joined #archiveteam-bs
06:11 ^🔗		newbie45 has joined #archiveteam-bs
06:14 ^🔗		jianaran has quit IRC (Ping timeout: 268 seconds)
06:15 ^🔗		Sk1d has quit IRC (Read error: Operation timed out)
06:19 ^🔗		Sk1d has joined #archiveteam-bs
06:39 ^🔗		chimyatta has joined #archiveteam-bs
06:39 ^🔗		newbie45 has quit IRC (Ping timeout: 268 seconds)
06:40 ^🔗		astrid has quit IRC (Read error: Operation timed out)
06:47 ^🔗		astrid has joined #archiveteam-bs
06:48 ^🔗		svchfoo3 sets mode: +o astrid
07:07 ^🔗		wyatt8740 has joined #archiveteam-bs
07:17 ^🔗		wyatt8740 has quit IRC (Ping timeout: 255 seconds)
07:18 ^🔗		wyatt8740 has joined #archiveteam-bs
07:23 ^🔗		wyatt8740 has quit IRC (Read error: Operation timed out)
07:33 ^🔗		Exairnous has quit IRC (Ping timeout: 246 seconds)
07:43 ^🔗		VADemon has quit IRC (Read error: Connection reset by peer)
08:05 ^🔗	jodizzle	jianaran: Someone correct me if I'm wrong, but I believe there's usually a digest in the WARC header that can be used to dedupe.
08:06 ^🔗	jodizzle	WARC-Payload-Digest
08:07 ^🔗	jodizzle	In some setups it's used to write WARC revisit records
08:16 ^🔗		Sk1d has quit IRC (Read error: Operation timed out)
08:19 ^🔗		Sk1d has joined #archiveteam-bs
08:36 ^🔗		wp494_ has joined #archiveteam-bs
08:40 ^🔗		wp494 has quit IRC (Read error: Operation timed out)
08:55 ^🔗		BlueMax has quit IRC (Read error: Connection reset by peer)
08:56 ^🔗		BlueMax has joined #archiveteam-bs
09:00 ^🔗		Coderjo_ has joined #archiveteam-bs
09:05 ^🔗		Coderjo has quit IRC (Ping timeout: 615 seconds)
09:51 ^🔗		Sk1d has quit IRC (Read error: Operation timed out)
09:55 ^🔗		Sk1d has joined #archiveteam-bs
10:14 ^🔗		BlueMax has quit IRC (Quit: Leaving)
10:29 ^🔗		atomicthu has quit IRC (No Ping reply in 180 seconds.)
10:35 ^🔗		atomicthu has joined #archiveteam-bs
11:10 ^🔗		m007a83_ has joined #archiveteam-bs
11:11 ^🔗		m007a83 has quit IRC (Read error: Operation timed out)
11:12 ^🔗		Hani111 has joined #archiveteam-bs
11:14 ^🔗		Hani has quit IRC (Read error: Operation timed out)
11:14 ^🔗		Hani111 is now known as Hani
11:46 ^🔗		Hani111 has joined #archiveteam-bs
11:48 ^🔗		Hani has quit IRC (Read error: Operation timed out)
11:57 ^🔗		Hani111 has quit IRC (Ping timeout: 615 seconds)
13:04 ^🔗		Mateon1 has quit IRC (Read error: Operation timed out)
13:04 ^🔗		Mateon1 has joined #archiveteam-bs
13:13 ^🔗		VerfiedJ has joined #archiveteam-bs
13:21 ^🔗	JAA	Yeah, you'll want to dedupe based on the digest, but I wouldn't recommend implementing that yourself unless you're very familiar with the WARC spec. There should be code for this in some of our projects, but I'm not sure how reusable it is.
13:22 ^🔗		VADemon has joined #archiveteam-bs
13:23 ^🔗	PurpleSym	crocoite has a tool (now with tests :)) which does that: https://github.com/PromyLOPh/crocoite/blob/master/crocoite/tools.py#L36
13:36 ^🔗		ubahn has joined #archiveteam-bs
14:32 ^🔗	astrid	oOoOo
14:38 ^🔗	arkiver	what is this crocoite?
14:38 ^🔗	arkiver	I wouldn´t use it for deduplication!
14:39 ^🔗	arkiver	PurpleSym: astrid: ^
14:39 ^🔗	PurpleSym	Can you elaborate?
14:39 ^🔗	PurpleSym	crocoite is chromebot.
14:41 ^🔗	arkiver	hmm
14:43 ^🔗	arkiver	actually, it might be good enough, just looked at https://github.com/webrecorder/warcio/blob/master/warcio/warcwriter.py#L169-L184
14:43 ^🔗	arkiver	initially saw a lot of WARC headers missing
14:43 ^🔗	arkiver	Do we know what is happening with the WARC-Block-Digest?
14:43 ^🔗	arkiver	or was that never saved in the records initially
14:44 ^🔗	PurpleSym	I think warcio takes care of that.
14:44 ^🔗	arkiver	hmm
14:44 ^🔗	arkiver	WARC records generated by crocoite therefore are an abstract view on the resource they represent and not necessarily the data sent over the wire. A URL fetched with HTTP/2 for example will still result in a HTTP/1.1 request/response pair in the WARC file. This may be undesireable from an archivist’s point of view (“save the data exactly like we received it”). But this level of abstraction is inevitable when dealing with more tha
14:44 ^🔗	arkiver	n one protocol.
14:44 ^🔗	arkiver	I think this came up reviously
14:45 ^🔗	arkiver	previously*
14:45 ^🔗	arkiver	does this stuff go into the Wayback Machine?
14:45 ^🔗	PurpleSym	I think so, https://archive.org/details/archiveteam_chromebot
14:48 ^🔗	arkiver	Do we know if brozzler has the same problem?
14:48 ^🔗	arkiver	Brozzler uses chrome too I think, but captures content differently
14:48 ^🔗	PurpleSym	Afaik brozzler uses a proxy.
14:49 ^🔗		Sk1d has quit IRC (Read error: Operation timed out)
14:49 ^🔗	PurpleSym	So, no, it does not have this “problem”.
14:51 ^🔗	arkiver	Also regarding televisiontunes, I asked, will know soon
14:51 ^🔗	PurpleSym	Thanks.
14:51 ^🔗	arkiver	What is the reason we´re currently using crocoite and not chromebot?
14:52 ^🔗	PurpleSym	brozzler, you mean?
14:52 ^🔗	arkiver	err yeah
14:52 ^🔗	arkiver	crocoite and not brozzler
14:52 ^🔗	*	arkiver was distracted
14:52 ^🔗		Sk1d has joined #archiveteam-bs
14:53 ^🔗	PurpleSym	I didn’t know about brozzler when I started this project.
14:54 ^🔗	PurpleSym	And since nobody’s spent some time setting up a brozzler IRC bot, well, …
14:54 ^🔗	arkiver	I see
14:54 ^🔗	PurpleSym	Also I believe the DOM snapshot/page screenshot crocite creates are important.
14:55 ^🔗	PurpleSym	(I haven’t seen any other software doing that yet.)
15:03 ^🔗	arkiver	right, ok
15:03 ^🔗	arkiver	thanks
15:07 ^🔗	PurpleSym	arkiver: Do you have any thoughts on how to push IC datasheets to IA? t3’s been throwing vendor websites into archivebot, but ultimately we need them as items on IA I believe.
15:08 ^🔗	arkiver	yes, definitely want them on IA
15:08 ^🔗	arkiver	don´t we already have a datasheets collection on IA?
15:09 ^🔗	arkiver	https://archive.org/details/ic_datasheets
15:10 ^🔗	arkiver	we probably want to add them to that collection
15:10 ^🔗	PurpleSym	That’s what I was thinking. What about metadata? The description field is not very useful imo.
15:11 ^🔗	PurpleSym	Like: Vendor, markings, revision, …
15:14 ^🔗	arkiver	the description field on IA?
15:14 ^🔗	arkiver	I´d try to add as much as possible as metadata fields
15:15 ^🔗	PurpleSym	Are they indexed/searchable?
15:17 ^🔗	arkiver	like contributor, creator, language, source, publisher, licenseurl, etc.
15:17 ^🔗	arkiver	basically just as much as possible from https://archive.org/services/docs/api/metadata-schema/index.html
15:17 ^🔗	arkiver	PurpleSym: I think so, but if not we can always add stuff to the description later
15:17 ^🔗	arkiver	source would be the direct URL to the PDF
15:18 ^🔗	arkiver	and make sure it´s in the Wayback Machine, but most has already gone through archivebot as you said so that should be good
15:19 ^🔗	PurpleSym	Thanks for the link, will look into that and see how we can map available metadata onto that schema.
15:19 ^🔗	arkiver	awesome
15:19 ^🔗	arkiver	and if anything doesn´t fit, please let me know
15:19 ^🔗	arkiver	to create custom fields
15:24 ^🔗		icedice has joined #archiveteam-bs
15:27 ^🔗		Sk1d has quit IRC (Read error: Operation timed out)
15:30 ^🔗		Sk1d has joined #archiveteam-bs
15:32 ^🔗		ubahn has quit IRC (Quit: ubahn)
15:37 ^🔗		Sk1d has quit IRC (Read error: Operation timed out)
15:40 ^🔗		Sk1d has joined #archiveteam-bs
15:53 ^🔗		Oddly has joined #archiveteam-bs
15:57 ^🔗	t3	Hey.
15:58 ^🔗	t3	arkiver: I made an account, about a month ago, to upload WARCs of the datasheets, but I think I am being throttled because I get "503 Slow Down" messages.
15:58 ^🔗	t3	I'm not exactly sure what to do.
16:00 ^🔗	t3	Currently, the easiest solution is to make WARCs of the websites that have the datasheets and then upload them to IA.
16:01 ^🔗	arkiver	t3: you can use archivebot?
16:01 ^🔗	t3	arkiver: I will take a look at your link.
16:02 ^🔗	t3	arkiver: Well, the issue is that ArchiveBot is usually busy, and I don't want to look bad by over-using it.
16:03 ^🔗	t3	Quite honestly, it would be nice to be an ops on that channel. I'm trying my best not to hoard resources.
16:03 ^🔗	arkiver	trc: please ping me if you read this
16:03 ^🔗	arkiver	t3: you can everything now too right?
16:04 ^🔗	t3	arkiver: What is trc?
16:04 ^🔗	arkiver	so we don´t usually add WARCs to the Wayback Machine that are uploaded by any users
16:04 ^🔗	arkiver	that way it would be too easy for others to get bad data in the Wayback Machin
16:04 ^🔗	arkiver	Machine*
16:04 ^🔗	arkiver	trc is/was someone here
16:05 ^🔗	t3	arkiver: Yeah, but I'm basically doing bulk uploads of WARCs using grab-site.
16:06 ^🔗	arkiver	where
16:06 ^🔗	t3	I'm hoping to be whitelisted so that the WARCs can be injested into Wayback Machine.
16:06 ^🔗	t3	But I have over 300 GB of pending uploads, because I'm being throttled.
16:07 ^🔗	arkiver	what are the items, what is your accunt
16:07 ^🔗	arkiver	account*
16:07 ^🔗	t3	arkiver: https://archive.org/details/@warchiver
16:07 ^🔗	arkiver	I can´t promise anything about getting into the Wayback Machine, because of the reason I previously mentioned
16:07 ^🔗	t3	Yes, understood.
16:09 ^🔗	t3	I also have WARCs for websites that are not covered on the Wayback Machine.
16:11 ^🔗		astrid has left ][
16:18 ^🔗	t3	The items? Oh, well I have WARCs, basically. There are a lot of WARCs. More than 10 semiconductor websites, with more to come from the wiki page I've been working on (https://www.archiveteam.org/index.php?title=IC_datasheets). I also have many WARCs of corporate websites.
16:20 ^🔗	t3	I also have WARCs of under-covered blogs and other sites that are still not on Wayback Machine.
16:22 ^🔗		chimyatta has quit IRC (Quit: quitting)
16:24 ^🔗	t3	I currently have 205 WARCs queued for uploading.
16:29 ^🔗		Sk1d has quit IRC (Read error: Operation timed out)
16:31 ^🔗		Sk1d has joined #archiveteam-bs
16:36 ^🔗		Hani111 has joined #archiveteam-bs
16:36 ^🔗		Hani111 is now known as Hani
16:50 ^🔗		Oddly has quit IRC (Ping timeout: 255 seconds)
16:54 ^🔗		Sk1d has quit IRC (Read error: Operation timed out)
16:57 ^🔗		Sk1d has joined #archiveteam-bs
17:10 ^🔗		chimyatta has joined #archiveteam-bs
17:34 ^🔗		wp494 has joined #archiveteam-bs
17:37 ^🔗		wp494_ has quit IRC (Ping timeout: 268 seconds)
17:40 ^🔗		Despatche has joined #archiveteam-bs
18:01 ^🔗		ubahn has joined #archiveteam-bs
18:10 ^🔗		m007a83_ is now known as m007a83
18:11 ^🔗		m007a83 has quit IRC (Quit: Fuck you Comcast)
18:11 ^🔗		m007a83 has joined #archiveteam-bs
18:24 ^🔗	Fusl	is anyone here able and willing to dump the warc from here into the WBM? https://archive.org/details/www.opennicproject.org-2019-01-29-43f55a63
18:26 ^🔗		schbirid has joined #archiveteam-bs
18:34 ^🔗		Hani111 has joined #archiveteam-bs
18:34 ^🔗		Hani has quit IRC (Read error: Connection reset by peer)
18:35 ^🔗		Hani111 is now known as Hani
18:39 ^🔗	t3	Fusl: I'm just curious. How did you set up your upload script?
18:43 ^🔗		ubahn has quit IRC (Quit: ubahn)
18:48 ^🔗	arkiver	Fusl: you don´t have to upload a CDX file, btw
18:48 ^🔗	Fusl	i don't?
18:48 ^🔗	arkiver	no
18:48 ^🔗	arkiver	IA derives the WARC files and creates it´s own CDX file
18:48 ^🔗	arkiver	uploading only the WARC file is enough
18:48 ^🔗	Fusl	what about -meta.warc.gz?
18:49 ^🔗	arkiver	yes, that one contains valuable information
18:49 ^🔗	arkiver	upload all WARCs, no CDXs, the CDX can be directly derived from the WARC and holds no special information
18:52 ^🔗	JAA	arkiver: Curious, what does IA do when you do upload a CDX? Is it moved to a different filename or just deleted entirely?
18:52 ^🔗	arkiver	not sure about overwriting
18:53 ^🔗	arkiver	in this case a .cdx file was uploaded and IA will just ignore it
18:54 ^🔗	Fusl	t3:
18:54 ^🔗	Fusl	id="${1}"
18:54 ^🔗	Fusl	test -d "/data/${id}" \|\| { echo "grab-site folder with ID ${id} doesn't exist"; exit 1; }
18:54 ^🔗	Fusl	docker run -ti --rm --log-driver=none -v /data/:/data/:ro ateam/ia:latest upload "${id}" "/data/${id}/${id}-"*".warc.gz" --metadata=mediatype:web
18:55 ^🔗	Fusl	with #!/bin/bash in frist line, just didnt copy that in here
18:56 ^🔗	arkiver	PurpleSym: green light for televisiontunes. please make sure you to get the metadata right :)
18:56 ^🔗	Fusl	grab-site also downloads everything into /data so there's that
18:56 ^🔗	t3	Fusl: Thanks. I haven't used Docker to run grab-site before.
18:57 ^🔗	Fusl	here's my grab-site Dockerfile: http://xor.meo.ws/puJtxPYKJZxVxLL1PLoNVTF68k2wVLq7.txt
18:57 ^🔗	PurpleSym	arkiver: Unfortunately there’s not much metadata available. All I have right now is a title and the original file URL(s).
18:58 ^🔗	arkiver	ok
18:58 ^🔗	Fusl	run the grab-site web dashboard server with `docker run -p 29000:29000 --entrypoint gs-server ateam/grab-site:latest` and the actual grab-site with `docker logs -f -t $(docker run --rm -d -e GRAB_SITE_HOST=172.17.0.1 --name "grab-site_$(cat /proc/sys/kernel/random/uuid)" -v /data:/data:rw ateam/grab-site:latest --igon --import-ignores /data/ignores --no-offsite-links "${@}" \| tee /dev/stderr)` and you're
18:58 ^🔗	Fusl	good to go
18:59 ^🔗	arkiver	please set the direct MP3 URL as source and add tags if available, for example http://televisiontunes.com/Olympics_-_1964_-_Tokyo_Melody.html has Sports and Olympics
19:00 ^🔗	arkiver	PurpleSym: ^
19:00 ^🔗	PurpleSym	Oh, I must’ve missed the tags while extracting metadata. I’ll fix this.
19:00 ^🔗	arkiver	yeah, looks like some don´t have tags
19:00 ^🔗	PurpleSym	Can you create a collection before I start uploading?
19:00 ^🔗	t3	Fusl: Thanks! I have been doing everything without Docker.
19:00 ^🔗	arkiver	PurpleSym: what is your email?
19:00 ^🔗	arkiver	from your account
19:01 ^🔗	t3	Fusl: To prevent cluttering this channel, can I private message you?
19:01 ^🔗	Fusl	sure
19:01 ^🔗	PurpleSym	arkiver: lars+archive@6xq.net
19:02 ^🔗	arkiver	thanks
19:20 ^🔗		ubahn has joined #archiveteam-bs
19:46 ^🔗		Oddly has joined #archiveteam-bs
19:47 ^🔗	arkiver	PurpleSym: https://archive.org/details/tvtunes you have access
20:06 ^🔗		Iridium has quit IRC (no.money.no.love)
20:09 ^🔗		Iridium has joined #archiveteam-bs
20:37 ^🔗		Exairnous has joined #archiveteam-bs
21:12 ^🔗		BlueMax has joined #archiveteam-bs
21:18 ^🔗		schbirid has quit IRC (Remote host closed the connection)
21:24 ^🔗		Sk1d has quit IRC (Read error: Operation timed out)
21:26 ^🔗		Sk1d has joined #archiveteam-bs
22:18 ^🔗		Sk1d has quit IRC (Read error: Operation timed out)
22:21 ^🔗		Sk1d has joined #archiveteam-bs
22:34 ^🔗		gandalf has joined #archiveteam-bs
23:03 ^🔗		MR9K has quit IRC (tilde lounge - https://irc.tilde.team)
23:03 ^🔗		MR9K has joined #archiveteam-bs
23:06 ^🔗		ubahn has quit IRC (Quit: ubahn)
23:14 ^🔗		ndiddy has joined #archiveteam-bs
23:16 ^🔗		tomaspark has quit IRC (Read error: Connection reset by peer)
23:55 ^🔗		Oddly has quit IRC (Ping timeout: 255 seconds)

irclogger-viewer